Commit
·
2d6f4c3
1
Parent(s):
46b1374
Add GGUF models + tokenizer with LFS
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +2 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_code.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_general.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_math.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +189 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +188 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt +11 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt +190 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt +190 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt +190 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt +0 -0
- Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M F16 | 649.80 MiB | 340.33 M | CUDA | 35 | pp8 | 1863.96 ± 66.78 |
|
| 9 |
+
| granitehybrid 350M F16 | 649.80 MiB | 340.33 M | CUDA | 35 | tg128 | 305.52 ± 3.33 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21458 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/granite-4.0-h-350m-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.vocab_size u32 = 100352
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.rope.dimension_count u32 = 64
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.attention.scale f32 = 0.015625
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.embedding_scale f32 = 12.000000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.residual_scale f32 = 0.246000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.logit_scale f32 = 3.000000
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.conv_kernel u32 = 4
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.state_size u32 = 128
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.group_count u32 = 1
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.inner_size u32 = 1536
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.ssm.time_step_rank u32 = 48
|
| 46 |
+
llama_model_loader: - kv 35: granitehybrid.rope.scaling.finetuned bool = false
|
| 47 |
+
llama_model_loader: - kv 36: general.quantization_version u32 = 2
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.pre str = dbrx
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 100257
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 100257
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 100256
|
| 57 |
+
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = false
|
| 58 |
+
llama_model_loader: - kv 47: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type f16: 169 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = F16
|
| 63 |
+
print_info: file size = 649.80 MiB (16.02 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 768
|
| 73 |
+
print_info: n_embd_inp = 768
|
| 74 |
+
print_info: n_layer = 32
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 64
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 64
|
| 81 |
+
print_info: n_embd_head_v = 64
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 3.0e+00
|
| 90 |
+
print_info: f_attn_scale = 1.6e-02
|
| 91 |
+
print_info: n_ff = 2048
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 1536
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 350M
|
| 111 |
+
print_info: model params = 340.33 M
|
| 112 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.246000
|
| 115 |
+
print_info: f_attention_scale = 0.015625
|
| 116 |
+
print_info: n_ff_shexp = 2048
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 649.80 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 153.95 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 158.18 MiB
|
| 139 |
+
...............................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 160 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 352.02 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 165 |
+
llama_context: graph nodes = 1815
|
| 166 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 96.589 ms
|
| 175 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 0.52 seconds per pass - ETA 0.37 minutes
|
| 177 |
+
[1]4.3688,[2]3.9755,[3]2.5656,[4]2.3655,[5]2.6049,[6]2.8500,[7]2.7007,[8]2.5086,[9]2.3055,[10]2.1367,[11]2.1192,[12]2.1455,[13]2.0570,[14]2.0369,[15]2.0775,[16]2.0120,[17]1.9865,[18]2.0048,[19]1.9652,[20]1.9300,[21]1.8972,[22]1.8825,[23]1.9116,[24]1.8852,[25]1.9041,[26]1.8721,[27]1.8590,[28]1.8508,[29]1.8964,[30]1.9127,[31]1.9120,[32]1.8878,[33]1.9115,[34]1.9038,[35]1.8852,[36]1.9163,[37]1.9228,[38]1.9205,[39]1.9423,[40]1.9397,[41]1.9327,[42]1.9567,[43]1.9654,[44]1.9547,
|
| 178 |
+
Final estimate: PPL = 1.9547 +/- 0.01753
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 311.22 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 15790.91 ms / 90112 tokens ( 0.18 ms per token, 5706.57 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 16587.55 ms / 90113 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20796 + ( 517 = 153 + 10 + 353) + 2793 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23312 + ( 189 = 158 + 8 + 22) + 622 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 678 = 649 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21458 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/granite-4.0-h-350m-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.vocab_size u32 = 100352
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.rope.dimension_count u32 = 64
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.attention.scale f32 = 0.015625
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.embedding_scale f32 = 12.000000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.residual_scale f32 = 0.246000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.logit_scale f32 = 3.000000
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.conv_kernel u32 = 4
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.state_size u32 = 128
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.group_count u32 = 1
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.inner_size u32 = 1536
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.ssm.time_step_rank u32 = 48
|
| 46 |
+
llama_model_loader: - kv 35: granitehybrid.rope.scaling.finetuned bool = false
|
| 47 |
+
llama_model_loader: - kv 36: general.quantization_version u32 = 2
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.pre str = dbrx
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 100257
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 100257
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 100256
|
| 57 |
+
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = false
|
| 58 |
+
llama_model_loader: - kv 47: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type f16: 169 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = F16
|
| 63 |
+
print_info: file size = 649.80 MiB (16.02 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 768
|
| 73 |
+
print_info: n_embd_inp = 768
|
| 74 |
+
print_info: n_layer = 32
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 64
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 64
|
| 81 |
+
print_info: n_embd_head_v = 64
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 3.0e+00
|
| 90 |
+
print_info: f_attn_scale = 1.6e-02
|
| 91 |
+
print_info: n_ff = 2048
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 1536
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 350M
|
| 111 |
+
print_info: model params = 340.33 M
|
| 112 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.246000
|
| 115 |
+
print_info: f_attention_scale = 0.015625
|
| 116 |
+
print_info: n_ff_shexp = 2048
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 649.80 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 153.95 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 158.18 MiB
|
| 139 |
+
...............................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 160 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 352.02 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 165 |
+
llama_context: graph nodes = 1815
|
| 166 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 42.35 ms
|
| 175 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 0.52 seconds per pass - ETA 0.12 minutes
|
| 177 |
+
[1]18.5629,[2]21.5554,[3]22.2031,[4]20.1836,[5]20.1858,[6]18.0109,[7]17.6304,[8]17.5848,[9]18.0830,[10]18.0578,[11]17.9015,[12]18.0183,[13]18.0815,[14]18.1241,
|
| 178 |
+
Final estimate: PPL = 18.1241 +/- 0.46538
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 284.72 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 5102.84 ms / 28672 tokens ( 0.18 ms per token, 5618.83 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 5376.12 ms / 28673 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20796 + ( 517 = 153 + 10 + 353) + 2793 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23312 + ( 189 = 158 + 8 + 22) + 622 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 678 = 649 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21458 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/granite-4.0-h-350m-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.vocab_size u32 = 100352
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.rope.dimension_count u32 = 64
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.attention.scale f32 = 0.015625
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.embedding_scale f32 = 12.000000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.residual_scale f32 = 0.246000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.logit_scale f32 = 3.000000
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.conv_kernel u32 = 4
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.state_size u32 = 128
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.group_count u32 = 1
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.inner_size u32 = 1536
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.ssm.time_step_rank u32 = 48
|
| 46 |
+
llama_model_loader: - kv 35: granitehybrid.rope.scaling.finetuned bool = false
|
| 47 |
+
llama_model_loader: - kv 36: general.quantization_version u32 = 2
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.model str = gpt2
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.pre str = dbrx
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 100257
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 100257
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.ggml.padding_token_id u32 = 100256
|
| 57 |
+
llama_model_loader: - kv 46: tokenizer.ggml.add_bos_token bool = false
|
| 58 |
+
llama_model_loader: - kv 47: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type f16: 169 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = F16
|
| 63 |
+
print_info: file size = 649.80 MiB (16.02 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 768
|
| 73 |
+
print_info: n_embd_inp = 768
|
| 74 |
+
print_info: n_layer = 32
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 64
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 64
|
| 81 |
+
print_info: n_embd_head_v = 64
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 3.0e+00
|
| 90 |
+
print_info: f_attn_scale = 1.6e-02
|
| 91 |
+
print_info: n_ff = 2048
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 1536
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 350M
|
| 111 |
+
print_info: model params = 340.33 M
|
| 112 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.246000
|
| 115 |
+
print_info: f_attention_scale = 0.015625
|
| 116 |
+
print_info: n_ff_shexp = 2048
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 649.80 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 153.95 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 158.18 MiB
|
| 139 |
+
...............................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 160 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 352.02 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 165 |
+
llama_context: graph nodes = 1815
|
| 166 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 34.235 ms
|
| 175 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 0.52 seconds per pass - ETA 0.12 minutes
|
| 177 |
+
[1]8.6779,[2]9.8888,[3]9.4631,[4]9.8086,[5]9.9478,[6]10.0321,[7]10.1862,[8]9.8834,[9]9.9359,[10]9.9483,[11]10.1876,[12]10.2754,[13]10.3986,[14]10.3756,[15]10.2753,
|
| 178 |
+
Final estimate: PPL = 10.2753 +/- 0.23118
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 282.83 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 5442.99 ms / 30720 tokens ( 0.18 ms per token, 5643.95 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 5713.68 ms / 30721 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20796 + ( 517 = 153 + 10 + 353) + 2793 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23312 + ( 189 = 158 + 8 + 22) + 622 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 678 = 649 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M MXFP4 MoE | 416.84 MiB | 340.33 M | CUDA | 35 | pp8 | 1580.44 ± 52.60 |
|
| 9 |
+
| granitehybrid 350M MXFP4 MoE | 416.84 MiB | 340.33 M | CUDA | 35 | tg128 | 317.44 ± 10.99 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21550 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type f16: 5 tensors
|
| 61 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 416.84 MiB (10.27 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 249.06 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 83.03 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.77 MiB
|
| 140 |
+
..................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 344.50 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 86.825 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.52 seconds per pass - ETA 0.37 minutes
|
| 178 |
+
[1]4.3698,[2]3.9753,[3]2.5675,[4]2.3676,[5]2.6020,[6]2.8471,[7]2.6975,[8]2.5061,[9]2.3039,[10]2.1360,[11]2.1187,[12]2.1454,[13]2.0570,[14]2.0370,[15]2.0772,[16]2.0117,[17]1.9868,[18]2.0053,[19]1.9660,[20]1.9307,[21]1.8977,[22]1.8830,[23]1.9123,[24]1.8859,[25]1.9048,[26]1.8728,[27]1.8595,[28]1.8512,[29]1.8969,[30]1.9134,[31]1.9125,[32]1.8882,[33]1.9117,[34]1.9038,[35]1.8851,[36]1.9164,[37]1.9228,[38]1.9208,[39]1.9424,[40]1.9398,[41]1.9326,[42]1.9567,[43]1.9653,[44]1.9546,
|
| 179 |
+
Final estimate: PPL = 1.9546 +/- 0.01751
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 216.12 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 14903.60 ms / 90112 tokens ( 0.17 ms per token, 6046.32 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 15698.28 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20966 + ( 437 = 83 + 10 + 344) + 2703 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23386 + ( 116 = 84 + 8 + 22) + 621 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 277 = 249 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21551 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type f16: 5 tensors
|
| 61 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 416.84 MiB (10.27 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 249.06 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 83.03 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.77 MiB
|
| 140 |
+
..................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 344.50 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 36.679 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.53 seconds per pass - ETA 0.12 minutes
|
| 178 |
+
[1]18.6129,[2]21.6640,[3]22.2947,[4]20.2218,[5]20.2095,[6]18.0178,[7]17.6488,[8]17.6025,[9]18.1184,[10]18.0967,[11]17.9390,[12]18.0537,[13]18.1240,[14]18.1603,
|
| 179 |
+
Final estimate: PPL = 18.1603 +/- 0.46663
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 349.90 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4858.50 ms / 28672 tokens ( 0.17 ms per token, 5901.41 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 5120.75 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20967 + ( 437 = 83 + 10 + 344) + 2702 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23386 + ( 116 = 84 + 8 + 22) + 621 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 277 = 249 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21550 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type f16: 5 tensors
|
| 61 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 416.84 MiB (10.27 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 249.06 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 83.03 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.77 MiB
|
| 140 |
+
..................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 344.50 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 34.666 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.52 seconds per pass - ETA 0.12 minutes
|
| 178 |
+
[1]8.6910,[2]9.9161,[3]9.4919,[4]9.8295,[5]9.9773,[6]10.0549,[7]10.2126,[8]9.9086,[9]9.9654,[10]9.9737,[11]10.2065,[12]10.2935,[13]10.4185,[14]10.3902,[15]10.2923,
|
| 179 |
+
Final estimate: PPL = 10.2923 +/- 0.23173
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 214.44 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 5171.34 ms / 30720 tokens ( 0.17 ms per token, 5940.43 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 5441.07 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 20966 + ( 437 = 83 + 10 + 344) + 2703 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23386 + ( 116 = 84 + 8 + 22) + 621 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 277 = 249 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M MXFP4 MoE | 307.95 MiB | 340.33 M | CUDA | 35 | pp8 | 1743.34 ± 32.04 |
|
| 9 |
+
| granitehybrid 350M MXFP4 MoE | 307.95 MiB | 340.33 M | CUDA | 35 | tg128 | 330.82 ± 5.17 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21550 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 307.95 MiB (7.59 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 142.60 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.42 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 83.97 MiB
|
| 140 |
+
........................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 245.96 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 94.463 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.46 seconds per pass - ETA 0.33 minutes
|
| 178 |
+
[1]4.5298,[2]4.0979,[3]2.6201,[4]2.4043,[5]2.6573,[6]2.9213,[7]2.7701,[8]2.5713,[9]2.3585,[10]2.1818,[11]2.1630,[12]2.1895,[13]2.0967,[14]2.0750,[15]2.1147,[16]2.0460,[17]2.0188,[18]2.0393,[19]1.9987,[20]1.9623,[21]1.9281,[22]1.9126,[23]1.9435,[24]1.9164,[25]1.9363,[26]1.9036,[27]1.8905,[28]1.8814,[29]1.9284,[30]1.9465,[31]1.9455,[32]1.9207,[33]1.9442,[34]1.9363,[35]1.9170,[36]1.9503,[37]1.9560,[38]1.9544,[39]1.9767,[40]1.9753,[41]1.9684,[42]1.9942,[43]2.0031,[44]1.9920,
|
| 179 |
+
Final estimate: PPL = 1.9920 +/- 0.01815
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 200.82 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 14017.94 ms / 90112 tokens ( 0.16 ms per token, 6428.33 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 14780.16 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21138 + ( 337 = 81 + 10 + 245) + 2630 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23400 + ( 115 = 83 + 8 + 22) + 608 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 171 = 142 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21550 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 307.95 MiB (7.59 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 142.60 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.42 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 83.97 MiB
|
| 140 |
+
........................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 245.96 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 37.278 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.46 seconds per pass - ETA 0.10 minutes
|
| 178 |
+
[1]19.6323,[2]22.7740,[3]23.5697,[4]21.3162,[5]21.3551,[6]18.9637,[7]18.5819,[8]18.5233,[9]19.1358,[10]19.0958,[11]18.9245,[12]19.0565,[13]19.0987,[14]19.1505,
|
| 179 |
+
Final estimate: PPL = 19.1505 +/- 0.49516
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 201.36 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4615.17 ms / 28672 tokens ( 0.16 ms per token, 6212.55 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 4867.74 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21146 + ( 337 = 81 + 10 + 245) + 2623 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23400 + ( 115 = 83 + 8 + 22) + 608 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 171 = 142 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21542 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 307.95 MiB (7.59 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 142.60 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.42 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 83.97 MiB
|
| 140 |
+
........................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 245.96 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 33.747 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.47 seconds per pass - ETA 0.12 minutes
|
| 178 |
+
[1]9.3665,[2]10.6437,[3]10.0524,[4]10.4644,[5]10.6332,[6]10.6906,[7]10.8464,[8]10.5098,[9]10.5672,[10]10.5575,[11]10.8126,[12]10.9208,[13]11.0423,[14]11.0198,[15]10.9094,
|
| 179 |
+
Final estimate: PPL = 10.9094 +/- 0.24996
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 203.94 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4804.48 ms / 30720 tokens ( 0.16 ms per token, 6394.03 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 5061.36 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21138 + ( 337 = 81 + 10 + 245) + 2630 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23400 + ( 115 = 83 + 8 + 22) + 608 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 171 = 142 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M MXFP4 MoE | 317.42 MiB | 340.33 M | CUDA | 35 | pp8 | 1742.99 ± 26.00 |
|
| 9 |
+
| granitehybrid 350M MXFP4 MoE | 317.42 MiB | 340.33 M | CUDA | 35 | tg128 | 331.05 ± 6.51 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21542 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q5_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 317.42 MiB (7.82 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 151.86 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.56 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.04 MiB
|
| 140 |
+
......................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 255.15 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 90.64 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.49 seconds per pass - ETA 0.35 minutes
|
| 178 |
+
[1]4.4294,[2]4.0259,[3]2.5919,[4]2.3810,[5]2.6241,[6]2.8771,[7]2.7244,[8]2.5295,[9]2.3227,[10]2.1512,[11]2.1334,[12]2.1598,[13]2.0705,[14]2.0490,[15]2.0916,[16]2.0251,[17]1.9995,[18]2.0185,[19]1.9787,[20]1.9438,[21]1.9094,[22]1.8943,[23]1.9239,[24]1.8973,[25]1.9167,[26]1.8843,[27]1.8706,[28]1.8621,[29]1.9080,[30]1.9243,[31]1.9234,[32]1.8988,[33]1.9229,[34]1.9149,[35]1.8960,[36]1.9278,[37]1.9340,[38]1.9318,[39]1.9539,[40]1.9512,[41]1.9443,[42]1.9689,[43]1.9774,[44]1.9665,
|
| 179 |
+
Final estimate: PPL = 1.9665 +/- 0.01775
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 199.76 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 14121.89 ms / 90112 tokens ( 0.16 ms per token, 6381.02 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 14870.27 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21128 + ( 346 = 81 + 10 + 255) + 2631 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 180 = 151 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21542 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q5_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 317.42 MiB (7.82 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 151.86 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.56 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.04 MiB
|
| 140 |
+
......................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 255.15 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 37.83 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.47 seconds per pass - ETA 0.10 minutes
|
| 178 |
+
[1]19.3339,[2]22.5132,[3]23.2334,[4]20.9767,[5]20.9869,[6]18.6407,[7]18.2091,[8]18.1540,[9]18.6822,[10]18.6766,[11]18.5038,[12]18.6767,[13]18.7570,[14]18.8193,
|
| 179 |
+
Final estimate: PPL = 18.8193 +/- 0.48637
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 200.97 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4583.55 ms / 28672 tokens ( 0.16 ms per token, 6255.42 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 4828.83 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21128 + ( 346 = 81 + 10 + 255) + 2631 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 180 = 151 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21542 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q5_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 317.42 MiB (7.82 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 151.86 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.56 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.04 MiB
|
| 140 |
+
......................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 255.15 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 34.842 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.49 seconds per pass - ETA 0.12 minutes
|
| 178 |
+
[1]8.8965,[2]10.1280,[3]9.6975,[4]10.0503,[5]10.1866,[6]10.2437,[7]10.3940,[8]10.0949,[9]10.1551,[10]10.1600,[11]10.4038,[12]10.4920,[13]10.6155,[14]10.5893,[15]10.4795,
|
| 179 |
+
Final estimate: PPL = 10.4795 +/- 0.23665
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 203.42 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4898.40 ms / 30720 tokens ( 0.16 ms per token, 6271.44 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 5157.31 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21128 + ( 346 = 81 + 10 + 255) + 2631 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 180 = 151 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q5_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M MXFP4 MoE | 327.48 MiB | 340.33 M | CUDA | 35 | pp8 | 1741.83 ± 26.78 |
|
| 9 |
+
| granitehybrid 350M MXFP4 MoE | 327.48 MiB | 340.33 M | CUDA | 35 | tg128 | 328.06 ± 4.18 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21541 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q6_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 327.48 MiB (8.07 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 161.69 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.71 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.11 MiB
|
| 140 |
+
...................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 264.91 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 87.31 ms
|
| 176 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.48 seconds per pass - ETA 0.33 minutes
|
| 178 |
+
[1]4.3924,[2]3.9909,[3]2.5740,[4]2.3730,[5]2.6140,[6]2.8599,[7]2.7103,[8]2.5176,[9]2.3127,[10]2.1429,[11]2.1251,[12]2.1520,[13]2.0628,[14]2.0428,[15]2.0830,[16]2.0172,[17]1.9919,[18]2.0105,[19]1.9704,[20]1.9350,[21]1.9022,[22]1.8872,[23]1.9169,[24]1.8904,[25]1.9094,[26]1.8772,[27]1.8638,[28]1.8555,[29]1.9011,[30]1.9178,[31]1.9167,[32]1.8923,[33]1.9159,[34]1.9080,[35]1.8891,[36]1.9205,[37]1.9268,[38]1.9248,[39]1.9465,[40]1.9438,[41]1.9364,[42]1.9605,[43]1.9691,[44]1.9583,
|
| 179 |
+
Final estimate: PPL = 1.9583 +/- 0.01755
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 205.79 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 14154.99 ms / 90112 tokens ( 0.16 ms per token, 6366.10 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 14895.22 ms / 90113 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21123 + ( 356 = 81 + 10 + 264) + 2627 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 190 = 161 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21541 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q6_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 327.48 MiB (8.07 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 161.69 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.71 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.11 MiB
|
| 140 |
+
...................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 264.91 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 37.397 ms
|
| 176 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.48 seconds per pass - ETA 0.10 minutes
|
| 178 |
+
[1]18.7156,[2]21.6973,[3]22.4044,[4]20.2801,[5]20.3099,[6]18.1309,[7]17.7535,[8]17.7048,[9]18.2308,[10]18.1990,[11]18.0260,[12]18.1323,[13]18.1933,[14]18.2289,
|
| 179 |
+
Final estimate: PPL = 18.2289 +/- 0.46969
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 200.73 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4623.61 ms / 28672 tokens ( 0.16 ms per token, 6201.22 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 4883.49 ms / 28673 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21117 + ( 356 = 81 + 10 + 264) + 2632 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 190 = 161 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21547 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q6_K: 5 tensors
|
| 62 |
+
print_info: file format = GGUF V3 (latest)
|
| 63 |
+
print_info: file type = MXFP4 MoE
|
| 64 |
+
print_info: file size = 327.48 MiB (8.07 BPW)
|
| 65 |
+
load: printing all EOG tokens:
|
| 66 |
+
load: - 100257 ('<|end_of_text|>')
|
| 67 |
+
load: - 100261 ('<|fim_pad|>')
|
| 68 |
+
load: special tokens cache size = 96
|
| 69 |
+
load: token to piece cache size = 0.6152 MB
|
| 70 |
+
print_info: arch = granitehybrid
|
| 71 |
+
print_info: vocab_only = 0
|
| 72 |
+
print_info: n_ctx_train = 1048576
|
| 73 |
+
print_info: n_embd = 768
|
| 74 |
+
print_info: n_embd_inp = 768
|
| 75 |
+
print_info: n_layer = 32
|
| 76 |
+
print_info: n_head = 12
|
| 77 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 78 |
+
print_info: n_rot = 64
|
| 79 |
+
print_info: n_swa = 0
|
| 80 |
+
print_info: is_swa_any = 0
|
| 81 |
+
print_info: n_embd_head_k = 64
|
| 82 |
+
print_info: n_embd_head_v = 64
|
| 83 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: f_norm_eps = 0.0e+00
|
| 87 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 88 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 89 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 90 |
+
print_info: f_logit_scale = 3.0e+00
|
| 91 |
+
print_info: f_attn_scale = 1.6e-02
|
| 92 |
+
print_info: n_ff = 2048
|
| 93 |
+
print_info: n_expert = 0
|
| 94 |
+
print_info: n_expert_used = 0
|
| 95 |
+
print_info: n_expert_groups = 0
|
| 96 |
+
print_info: n_group_used = 0
|
| 97 |
+
print_info: causal attn = 1
|
| 98 |
+
print_info: pooling type = 0
|
| 99 |
+
print_info: rope type = 0
|
| 100 |
+
print_info: rope scaling = linear
|
| 101 |
+
print_info: freq_base_train = 10000.0
|
| 102 |
+
print_info: freq_scale_train = 1
|
| 103 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 104 |
+
print_info: rope_finetuned = unknown
|
| 105 |
+
print_info: ssm_d_conv = 4
|
| 106 |
+
print_info: ssm_d_inner = 1536
|
| 107 |
+
print_info: ssm_d_state = 128
|
| 108 |
+
print_info: ssm_dt_rank = 48
|
| 109 |
+
print_info: ssm_n_group = 1
|
| 110 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 111 |
+
print_info: model type = 350M
|
| 112 |
+
print_info: model params = 340.33 M
|
| 113 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 114 |
+
print_info: f_embedding_scale = 12.000000
|
| 115 |
+
print_info: f_residual_scale = 0.246000
|
| 116 |
+
print_info: f_attention_scale = 0.015625
|
| 117 |
+
print_info: n_ff_shexp = 2048
|
| 118 |
+
print_info: vocab type = BPE
|
| 119 |
+
print_info: n_vocab = 100352
|
| 120 |
+
print_info: n_merges = 100000
|
| 121 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 125 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 126 |
+
print_info: LF token = 198 'Ċ'
|
| 127 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 128 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 129 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 130 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 131 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 132 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 133 |
+
print_info: max token length = 256
|
| 134 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 135 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 136 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 137 |
+
load_tensors: CPU_Mapped model buffer size = 161.69 MiB
|
| 138 |
+
load_tensors: CUDA0 model buffer size = 81.71 MiB
|
| 139 |
+
load_tensors: CUDA1 model buffer size = 84.11 MiB
|
| 140 |
+
...................................................................................
|
| 141 |
+
llama_context: constructing llama_context
|
| 142 |
+
llama_context: n_seq_max = 1
|
| 143 |
+
llama_context: n_ctx = 2048
|
| 144 |
+
llama_context: n_ctx_seq = 2048
|
| 145 |
+
llama_context: n_batch = 2048
|
| 146 |
+
llama_context: n_ubatch = 512
|
| 147 |
+
llama_context: causal_attn = 1
|
| 148 |
+
llama_context: flash_attn = auto
|
| 149 |
+
llama_context: kv_unified = false
|
| 150 |
+
llama_context: freq_base = 10000.0
|
| 151 |
+
llama_context: freq_scale = 1
|
| 152 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 153 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 154 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 157 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 158 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 161 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 162 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 163 |
+
llama_context: CUDA0 compute buffer size = 264.91 MiB
|
| 164 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 165 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 166 |
+
llama_context: graph nodes = 1815
|
| 167 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 168 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 169 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 170 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 171 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 172 |
+
|
| 173 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 174 |
+
perplexity: tokenizing the input ..
|
| 175 |
+
perplexity: tokenization took 34.837 ms
|
| 176 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 177 |
+
perplexity: 0.48 seconds per pass - ETA 0.12 minutes
|
| 178 |
+
[1]8.6995,[2]9.9257,[3]9.4832,[4]9.8110,[5]9.9656,[6]10.0381,[7]10.1907,[8]9.8900,[9]9.9451,[10]9.9543,[11]10.1895,[12]10.2744,[13]10.3992,[14]10.3734,[15]10.2754,
|
| 179 |
+
Final estimate: PPL = 10.2754 +/- 0.23106
|
| 180 |
+
|
| 181 |
+
llama_perf_context_print: load time = 201.16 ms
|
| 182 |
+
llama_perf_context_print: prompt eval time = 4938.26 ms / 30720 tokens ( 0.16 ms per token, 6220.81 tokens per second)
|
| 183 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 184 |
+
llama_perf_context_print: total time = 5213.96 ms / 30721 tokens
|
| 185 |
+
llama_perf_context_print: graphs reused = 0
|
| 186 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21123 + ( 356 = 81 + 10 + 264) + 2627 |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 189 |
+
llama_memory_breakdown_print: | - Host | 190 = 161 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M MXFP4 MoE | 345.83 MiB | 340.33 M | CUDA | 35 | pp8 | 1737.70 ± 26.88 |
|
| 9 |
+
| granitehybrid 350M MXFP4 MoE | 345.83 MiB | 340.33 M | CUDA | 35 | tg128 | 329.03 ± 4.34 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_code.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21547 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 169 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = MXFP4 MoE
|
| 63 |
+
print_info: file size = 345.83 MiB (8.52 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 768
|
| 73 |
+
print_info: n_embd_inp = 768
|
| 74 |
+
print_info: n_layer = 32
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 64
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 64
|
| 81 |
+
print_info: n_embd_head_v = 64
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 3.0e+00
|
| 90 |
+
print_info: f_attn_scale = 1.6e-02
|
| 91 |
+
print_info: n_ff = 2048
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 1536
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 350M
|
| 111 |
+
print_info: model params = 340.33 M
|
| 112 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.246000
|
| 115 |
+
print_info: f_attention_scale = 0.015625
|
| 116 |
+
print_info: n_ff_shexp = 2048
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 179.63 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 81.98 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 84.25 MiB
|
| 139 |
+
...............................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 160 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 275.59 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 165 |
+
llama_context: graph nodes = 1815
|
| 166 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 88.513 ms
|
| 175 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 0.47 seconds per pass - ETA 0.33 minutes
|
| 177 |
+
[1]4.3877,[2]3.9892,[3]2.5746,[4]2.3726,[5]2.6116,[6]2.8551,[7]2.7045,[8]2.5122,[9]2.3088,[10]2.1400,[11]2.1223,[12]2.1486,[13]2.0595,[14]2.0395,[15]2.0798,[16]2.0140,[17]1.9885,[18]2.0068,[19]1.9670,[20]1.9318,[21]1.8988,[22]1.8838,[23]1.9131,[24]1.8867,[25]1.9059,[26]1.8740,[27]1.8607,[28]1.8525,[29]1.8981,[30]1.9147,[31]1.9136,[32]1.8893,[33]1.9129,[34]1.9050,[35]1.8862,[36]1.9175,[37]1.9237,[38]1.9216,[39]1.9431,[40]1.9407,[41]1.9338,[42]1.9579,[43]1.9666,[44]1.9558,
|
| 178 |
+
Final estimate: PPL = 1.9558 +/- 0.01752
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 204.68 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 14225.09 ms / 90112 tokens ( 0.16 ms per token, 6334.72 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 14966.36 ms / 90113 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21112 + ( 367 = 81 + 10 + 275) + 2626 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 208 = 179 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_general.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21547 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 169 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = MXFP4 MoE
|
| 63 |
+
print_info: file size = 345.83 MiB (8.52 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 768
|
| 73 |
+
print_info: n_embd_inp = 768
|
| 74 |
+
print_info: n_layer = 32
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 64
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 64
|
| 81 |
+
print_info: n_embd_head_v = 64
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 3.0e+00
|
| 90 |
+
print_info: f_attn_scale = 1.6e-02
|
| 91 |
+
print_info: n_ff = 2048
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 1536
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 350M
|
| 111 |
+
print_info: model params = 340.33 M
|
| 112 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.246000
|
| 115 |
+
print_info: f_attention_scale = 0.015625
|
| 116 |
+
print_info: n_ff_shexp = 2048
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 179.63 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 81.98 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 84.25 MiB
|
| 139 |
+
...............................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 160 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 275.59 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 165 |
+
llama_context: graph nodes = 1815
|
| 166 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 38.561 ms
|
| 175 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 0.47 seconds per pass - ETA 0.10 minutes
|
| 177 |
+
[1]18.6670,[2]21.7211,[3]22.3605,[4]20.2514,[5]20.2527,[6]18.0499,[7]17.6941,[8]17.6549,[9]18.1826,[10]18.1614,[11]17.9974,[12]18.1189,[13]18.1948,[14]18.2363,
|
| 178 |
+
Final estimate: PPL = 18.2363 +/- 0.46935
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 206.56 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 4636.22 ms / 28672 tokens ( 0.16 ms per token, 6184.35 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 4891.79 ms / 28673 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21113 + ( 367 = 81 + 10 + 275) + 2625 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 208 = 179 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/perplexity_math.txt
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21546 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 169 tensors
|
| 61 |
+
print_info: file format = GGUF V3 (latest)
|
| 62 |
+
print_info: file type = MXFP4 MoE
|
| 63 |
+
print_info: file size = 345.83 MiB (8.52 BPW)
|
| 64 |
+
load: printing all EOG tokens:
|
| 65 |
+
load: - 100257 ('<|end_of_text|>')
|
| 66 |
+
load: - 100261 ('<|fim_pad|>')
|
| 67 |
+
load: special tokens cache size = 96
|
| 68 |
+
load: token to piece cache size = 0.6152 MB
|
| 69 |
+
print_info: arch = granitehybrid
|
| 70 |
+
print_info: vocab_only = 0
|
| 71 |
+
print_info: n_ctx_train = 1048576
|
| 72 |
+
print_info: n_embd = 768
|
| 73 |
+
print_info: n_embd_inp = 768
|
| 74 |
+
print_info: n_layer = 32
|
| 75 |
+
print_info: n_head = 12
|
| 76 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 77 |
+
print_info: n_rot = 64
|
| 78 |
+
print_info: n_swa = 0
|
| 79 |
+
print_info: is_swa_any = 0
|
| 80 |
+
print_info: n_embd_head_k = 64
|
| 81 |
+
print_info: n_embd_head_v = 64
|
| 82 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 83 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 84 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 85 |
+
print_info: f_norm_eps = 0.0e+00
|
| 86 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 87 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 88 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 89 |
+
print_info: f_logit_scale = 3.0e+00
|
| 90 |
+
print_info: f_attn_scale = 1.6e-02
|
| 91 |
+
print_info: n_ff = 2048
|
| 92 |
+
print_info: n_expert = 0
|
| 93 |
+
print_info: n_expert_used = 0
|
| 94 |
+
print_info: n_expert_groups = 0
|
| 95 |
+
print_info: n_group_used = 0
|
| 96 |
+
print_info: causal attn = 1
|
| 97 |
+
print_info: pooling type = 0
|
| 98 |
+
print_info: rope type = 0
|
| 99 |
+
print_info: rope scaling = linear
|
| 100 |
+
print_info: freq_base_train = 10000.0
|
| 101 |
+
print_info: freq_scale_train = 1
|
| 102 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 103 |
+
print_info: rope_finetuned = unknown
|
| 104 |
+
print_info: ssm_d_conv = 4
|
| 105 |
+
print_info: ssm_d_inner = 1536
|
| 106 |
+
print_info: ssm_d_state = 128
|
| 107 |
+
print_info: ssm_dt_rank = 48
|
| 108 |
+
print_info: ssm_n_group = 1
|
| 109 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 110 |
+
print_info: model type = 350M
|
| 111 |
+
print_info: model params = 340.33 M
|
| 112 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 113 |
+
print_info: f_embedding_scale = 12.000000
|
| 114 |
+
print_info: f_residual_scale = 0.246000
|
| 115 |
+
print_info: f_attention_scale = 0.015625
|
| 116 |
+
print_info: n_ff_shexp = 2048
|
| 117 |
+
print_info: vocab type = BPE
|
| 118 |
+
print_info: n_vocab = 100352
|
| 119 |
+
print_info: n_merges = 100000
|
| 120 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 121 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 122 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 124 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 125 |
+
print_info: LF token = 198 'Ċ'
|
| 126 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 127 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 128 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 129 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 130 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 131 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: max token length = 256
|
| 133 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 134 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 135 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 136 |
+
load_tensors: CPU_Mapped model buffer size = 179.63 MiB
|
| 137 |
+
load_tensors: CUDA0 model buffer size = 81.98 MiB
|
| 138 |
+
load_tensors: CUDA1 model buffer size = 84.25 MiB
|
| 139 |
+
...............................................................................
|
| 140 |
+
llama_context: constructing llama_context
|
| 141 |
+
llama_context: n_seq_max = 1
|
| 142 |
+
llama_context: n_ctx = 2048
|
| 143 |
+
llama_context: n_ctx_seq = 2048
|
| 144 |
+
llama_context: n_batch = 2048
|
| 145 |
+
llama_context: n_ubatch = 512
|
| 146 |
+
llama_context: causal_attn = 1
|
| 147 |
+
llama_context: flash_attn = auto
|
| 148 |
+
llama_context: kv_unified = false
|
| 149 |
+
llama_context: freq_base = 10000.0
|
| 150 |
+
llama_context: freq_scale = 1
|
| 151 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 152 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 153 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 154 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 155 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 157 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 158 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 159 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 160 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 161 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 162 |
+
llama_context: CUDA0 compute buffer size = 275.59 MiB
|
| 163 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 164 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 165 |
+
llama_context: graph nodes = 1815
|
| 166 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 167 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 168 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 169 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 170 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 171 |
+
|
| 172 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 173 |
+
perplexity: tokenizing the input ..
|
| 174 |
+
perplexity: tokenization took 35.476 ms
|
| 175 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 176 |
+
perplexity: 0.48 seconds per pass - ETA 0.12 minutes
|
| 177 |
+
[1]8.7345,[2]9.9471,[3]9.5142,[4]9.8645,[5]10.0021,[6]10.0841,[7]10.2351,[8]9.9375,[9]9.9876,[10]9.9976,[11]10.2390,[12]10.3288,[13]10.4470,[14]10.4207,[15]10.3198,
|
| 178 |
+
Final estimate: PPL = 10.3198 +/- 0.23252
|
| 179 |
+
|
| 180 |
+
llama_perf_context_print: load time = 205.29 ms
|
| 181 |
+
llama_perf_context_print: prompt eval time = 4958.00 ms / 30720 tokens ( 0.16 ms per token, 6196.04 tokens per second)
|
| 182 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 183 |
+
llama_perf_context_print: total time = 5217.02 ms / 30721 tokens
|
| 184 |
+
llama_perf_context_print: graphs reused = 0
|
| 185 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 186 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21114 + ( 367 = 81 + 10 + 275) + 2624 |
|
| 187 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23398 + ( 115 = 84 + 8 + 22) + 610 |
|
| 188 |
+
llama_memory_breakdown_print: | - Host | 208 = 179 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| granitehybrid 350M MXFP4 MoE | 307.88 MiB | 340.33 M | CUDA | 35 | pp8 | 1747.89 ± 15.68 |
|
| 9 |
+
| granitehybrid 350M MXFP4 MoE | 307.88 MiB | 340.33 M | CUDA | 35 | tg128 | 334.09 ± 4.66 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21552 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 62 |
+
llama_model_loader: - type mxfp4: 4 tensors
|
| 63 |
+
print_info: file format = GGUF V3 (latest)
|
| 64 |
+
print_info: file type = MXFP4 MoE
|
| 65 |
+
print_info: file size = 307.88 MiB (7.59 BPW)
|
| 66 |
+
load: printing all EOG tokens:
|
| 67 |
+
load: - 100257 ('<|end_of_text|>')
|
| 68 |
+
load: - 100261 ('<|fim_pad|>')
|
| 69 |
+
load: special tokens cache size = 96
|
| 70 |
+
load: token to piece cache size = 0.6152 MB
|
| 71 |
+
print_info: arch = granitehybrid
|
| 72 |
+
print_info: vocab_only = 0
|
| 73 |
+
print_info: n_ctx_train = 1048576
|
| 74 |
+
print_info: n_embd = 768
|
| 75 |
+
print_info: n_embd_inp = 768
|
| 76 |
+
print_info: n_layer = 32
|
| 77 |
+
print_info: n_head = 12
|
| 78 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 79 |
+
print_info: n_rot = 64
|
| 80 |
+
print_info: n_swa = 0
|
| 81 |
+
print_info: is_swa_any = 0
|
| 82 |
+
print_info: n_embd_head_k = 64
|
| 83 |
+
print_info: n_embd_head_v = 64
|
| 84 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 87 |
+
print_info: f_norm_eps = 0.0e+00
|
| 88 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 89 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 90 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 91 |
+
print_info: f_logit_scale = 3.0e+00
|
| 92 |
+
print_info: f_attn_scale = 1.6e-02
|
| 93 |
+
print_info: n_ff = 2048
|
| 94 |
+
print_info: n_expert = 0
|
| 95 |
+
print_info: n_expert_used = 0
|
| 96 |
+
print_info: n_expert_groups = 0
|
| 97 |
+
print_info: n_group_used = 0
|
| 98 |
+
print_info: causal attn = 1
|
| 99 |
+
print_info: pooling type = 0
|
| 100 |
+
print_info: rope type = 0
|
| 101 |
+
print_info: rope scaling = linear
|
| 102 |
+
print_info: freq_base_train = 10000.0
|
| 103 |
+
print_info: freq_scale_train = 1
|
| 104 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 105 |
+
print_info: rope_finetuned = unknown
|
| 106 |
+
print_info: ssm_d_conv = 4
|
| 107 |
+
print_info: ssm_d_inner = 1536
|
| 108 |
+
print_info: ssm_d_state = 128
|
| 109 |
+
print_info: ssm_dt_rank = 48
|
| 110 |
+
print_info: ssm_n_group = 1
|
| 111 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 112 |
+
print_info: model type = 350M
|
| 113 |
+
print_info: model params = 340.33 M
|
| 114 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 115 |
+
print_info: f_embedding_scale = 12.000000
|
| 116 |
+
print_info: f_residual_scale = 0.246000
|
| 117 |
+
print_info: f_attention_scale = 0.015625
|
| 118 |
+
print_info: n_ff_shexp = 2048
|
| 119 |
+
print_info: vocab type = BPE
|
| 120 |
+
print_info: n_vocab = 100352
|
| 121 |
+
print_info: n_merges = 100000
|
| 122 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 125 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 126 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 127 |
+
print_info: LF token = 198 'Ċ'
|
| 128 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 129 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 130 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 131 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 133 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 134 |
+
print_info: max token length = 256
|
| 135 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 136 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 137 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 138 |
+
load_tensors: CPU_Mapped model buffer size = 142.58 MiB
|
| 139 |
+
load_tensors: CUDA0 model buffer size = 81.38 MiB
|
| 140 |
+
load_tensors: CUDA1 model buffer size = 83.95 MiB
|
| 141 |
+
........................................................................................
|
| 142 |
+
llama_context: constructing llama_context
|
| 143 |
+
llama_context: n_seq_max = 1
|
| 144 |
+
llama_context: n_ctx = 2048
|
| 145 |
+
llama_context: n_ctx_seq = 2048
|
| 146 |
+
llama_context: n_batch = 2048
|
| 147 |
+
llama_context: n_ubatch = 512
|
| 148 |
+
llama_context: causal_attn = 1
|
| 149 |
+
llama_context: flash_attn = auto
|
| 150 |
+
llama_context: kv_unified = false
|
| 151 |
+
llama_context: freq_base = 10000.0
|
| 152 |
+
llama_context: freq_scale = 1
|
| 153 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 154 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 155 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 158 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 159 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 161 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 162 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 163 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 164 |
+
llama_context: CUDA0 compute buffer size = 245.96 MiB
|
| 165 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 166 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 167 |
+
llama_context: graph nodes = 1815
|
| 168 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 169 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 170 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 171 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 172 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 173 |
+
|
| 174 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 175 |
+
perplexity: tokenizing the input ..
|
| 176 |
+
perplexity: tokenization took 88.635 ms
|
| 177 |
+
perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 178 |
+
perplexity: 0.47 seconds per pass - ETA 0.33 minutes
|
| 179 |
+
[1]4.4986,[2]4.0911,[3]2.6221,[4]2.4166,[5]2.6557,[6]2.9173,[7]2.7668,[8]2.5706,[9]2.3568,[10]2.1804,[11]2.1612,[12]2.1893,[13]2.0975,[14]2.0757,[15]2.1161,[16]2.0473,[17]2.0205,[18]2.0398,[19]1.9985,[20]1.9621,[21]1.9284,[22]1.9130,[23]1.9432,[24]1.9166,[25]1.9366,[26]1.9038,[27]1.8914,[28]1.8828,[29]1.9304,[30]1.9485,[31]1.9471,[32]1.9223,[33]1.9459,[34]1.9381,[35]1.9189,[36]1.9530,[37]1.9586,[38]1.9571,[39]1.9794,[40]1.9778,[41]1.9709,[42]1.9966,[43]2.0058,[44]1.9946,
|
| 180 |
+
Final estimate: PPL = 1.9946 +/- 0.01813
|
| 181 |
+
|
| 182 |
+
llama_perf_context_print: load time = 203.27 ms
|
| 183 |
+
llama_perf_context_print: prompt eval time = 13828.33 ms / 90112 tokens ( 0.15 ms per token, 6516.48 tokens per second)
|
| 184 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 185 |
+
llama_perf_context_print: total time = 14568.59 ms / 90113 tokens
|
| 186 |
+
llama_perf_context_print: graphs reused = 0
|
| 187 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21148 + ( 337 = 81 + 10 + 245) + 2620 |
|
| 189 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23400 + ( 115 = 83 + 8 + 22) + 608 |
|
| 190 |
+
llama_memory_breakdown_print: | - Host | 171 = 142 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21548 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 62 |
+
llama_model_loader: - type mxfp4: 4 tensors
|
| 63 |
+
print_info: file format = GGUF V3 (latest)
|
| 64 |
+
print_info: file type = MXFP4 MoE
|
| 65 |
+
print_info: file size = 307.88 MiB (7.59 BPW)
|
| 66 |
+
load: printing all EOG tokens:
|
| 67 |
+
load: - 100257 ('<|end_of_text|>')
|
| 68 |
+
load: - 100261 ('<|fim_pad|>')
|
| 69 |
+
load: special tokens cache size = 96
|
| 70 |
+
load: token to piece cache size = 0.6152 MB
|
| 71 |
+
print_info: arch = granitehybrid
|
| 72 |
+
print_info: vocab_only = 0
|
| 73 |
+
print_info: n_ctx_train = 1048576
|
| 74 |
+
print_info: n_embd = 768
|
| 75 |
+
print_info: n_embd_inp = 768
|
| 76 |
+
print_info: n_layer = 32
|
| 77 |
+
print_info: n_head = 12
|
| 78 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 79 |
+
print_info: n_rot = 64
|
| 80 |
+
print_info: n_swa = 0
|
| 81 |
+
print_info: is_swa_any = 0
|
| 82 |
+
print_info: n_embd_head_k = 64
|
| 83 |
+
print_info: n_embd_head_v = 64
|
| 84 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 87 |
+
print_info: f_norm_eps = 0.0e+00
|
| 88 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 89 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 90 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 91 |
+
print_info: f_logit_scale = 3.0e+00
|
| 92 |
+
print_info: f_attn_scale = 1.6e-02
|
| 93 |
+
print_info: n_ff = 2048
|
| 94 |
+
print_info: n_expert = 0
|
| 95 |
+
print_info: n_expert_used = 0
|
| 96 |
+
print_info: n_expert_groups = 0
|
| 97 |
+
print_info: n_group_used = 0
|
| 98 |
+
print_info: causal attn = 1
|
| 99 |
+
print_info: pooling type = 0
|
| 100 |
+
print_info: rope type = 0
|
| 101 |
+
print_info: rope scaling = linear
|
| 102 |
+
print_info: freq_base_train = 10000.0
|
| 103 |
+
print_info: freq_scale_train = 1
|
| 104 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 105 |
+
print_info: rope_finetuned = unknown
|
| 106 |
+
print_info: ssm_d_conv = 4
|
| 107 |
+
print_info: ssm_d_inner = 1536
|
| 108 |
+
print_info: ssm_d_state = 128
|
| 109 |
+
print_info: ssm_dt_rank = 48
|
| 110 |
+
print_info: ssm_n_group = 1
|
| 111 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 112 |
+
print_info: model type = 350M
|
| 113 |
+
print_info: model params = 340.33 M
|
| 114 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 115 |
+
print_info: f_embedding_scale = 12.000000
|
| 116 |
+
print_info: f_residual_scale = 0.246000
|
| 117 |
+
print_info: f_attention_scale = 0.015625
|
| 118 |
+
print_info: n_ff_shexp = 2048
|
| 119 |
+
print_info: vocab type = BPE
|
| 120 |
+
print_info: n_vocab = 100352
|
| 121 |
+
print_info: n_merges = 100000
|
| 122 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 125 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 126 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 127 |
+
print_info: LF token = 198 'Ċ'
|
| 128 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 129 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 130 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 131 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 133 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 134 |
+
print_info: max token length = 256
|
| 135 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 136 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 137 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 138 |
+
load_tensors: CPU_Mapped model buffer size = 142.58 MiB
|
| 139 |
+
load_tensors: CUDA0 model buffer size = 81.38 MiB
|
| 140 |
+
load_tensors: CUDA1 model buffer size = 83.95 MiB
|
| 141 |
+
........................................................................................
|
| 142 |
+
llama_context: constructing llama_context
|
| 143 |
+
llama_context: n_seq_max = 1
|
| 144 |
+
llama_context: n_ctx = 2048
|
| 145 |
+
llama_context: n_ctx_seq = 2048
|
| 146 |
+
llama_context: n_batch = 2048
|
| 147 |
+
llama_context: n_ubatch = 512
|
| 148 |
+
llama_context: causal_attn = 1
|
| 149 |
+
llama_context: flash_attn = auto
|
| 150 |
+
llama_context: kv_unified = false
|
| 151 |
+
llama_context: freq_base = 10000.0
|
| 152 |
+
llama_context: freq_scale = 1
|
| 153 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 154 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 155 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 158 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 159 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 161 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 162 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 163 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 164 |
+
llama_context: CUDA0 compute buffer size = 245.96 MiB
|
| 165 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 166 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 167 |
+
llama_context: graph nodes = 1815
|
| 168 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 169 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 170 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 171 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 172 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 173 |
+
|
| 174 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 175 |
+
perplexity: tokenizing the input ..
|
| 176 |
+
perplexity: tokenization took 36.744 ms
|
| 177 |
+
perplexity: calculating perplexity over 14 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 178 |
+
perplexity: 0.48 seconds per pass - ETA 0.10 minutes
|
| 179 |
+
[1]19.6960,[2]22.9421,[3]23.7236,[4]21.3841,[5]21.4233,[6]19.0321,[7]18.6530,[8]18.6117,[9]19.2089,[10]19.1436,[11]18.9622,[12]19.0665,[13]19.1190,[14]19.1528,
|
| 180 |
+
Final estimate: PPL = 19.1528 +/- 0.49491
|
| 181 |
+
|
| 182 |
+
llama_perf_context_print: load time = 201.01 ms
|
| 183 |
+
llama_perf_context_print: prompt eval time = 4538.92 ms / 28672 tokens ( 0.16 ms per token, 6316.92 tokens per second)
|
| 184 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 185 |
+
llama_perf_context_print: total time = 4783.14 ms / 28673 tokens
|
| 186 |
+
llama_perf_context_print: graphs reused = 0
|
| 187 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21148 + ( 337 = 81 + 10 + 245) + 2620 |
|
| 189 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23400 + ( 115 = 83 + 8 + 22) + 608 |
|
| 190 |
+
llama_memory_breakdown_print: | - Host | 171 = 142 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21552 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 48 key-value pairs and 402 tensors from /mnt/world8/AI/Models/granite-4.0-h-350m-unsloth/GGUF/MXFP4/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = granitehybrid
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Granite 4.0 H 350m Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = granite-4.0-h
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 350M
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Granite 4.0 H 350m
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = Ibm Granite
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ibm-granite/gr...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["language", "unsloth", "granite-4.0"]
|
| 23 |
+
llama_model_loader: - kv 12: granitehybrid.block_count u32 = 32
|
| 24 |
+
llama_model_loader: - kv 13: granitehybrid.context_length u32 = 1048576
|
| 25 |
+
llama_model_loader: - kv 14: granitehybrid.embedding_length u32 = 768
|
| 26 |
+
llama_model_loader: - kv 15: granitehybrid.feed_forward_length u32 = 2048
|
| 27 |
+
llama_model_loader: - kv 16: granitehybrid.attention.head_count u32 = 12
|
| 28 |
+
llama_model_loader: - kv 17: granitehybrid.attention.head_count_kv arr[i32,32] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...
|
| 29 |
+
llama_model_loader: - kv 18: granitehybrid.rope.freq_base f32 = 10000.000000
|
| 30 |
+
llama_model_loader: - kv 19: granitehybrid.attention.layer_norm_rms_epsilon f32 = 0.000010
|
| 31 |
+
llama_model_loader: - kv 20: granitehybrid.expert_count u32 = 0
|
| 32 |
+
llama_model_loader: - kv 21: granitehybrid.expert_used_count u32 = 0
|
| 33 |
+
llama_model_loader: - kv 22: granitehybrid.vocab_size u32 = 100352
|
| 34 |
+
llama_model_loader: - kv 23: granitehybrid.rope.dimension_count u32 = 64
|
| 35 |
+
llama_model_loader: - kv 24: granitehybrid.attention.scale f32 = 0.015625
|
| 36 |
+
llama_model_loader: - kv 25: granitehybrid.embedding_scale f32 = 12.000000
|
| 37 |
+
llama_model_loader: - kv 26: granitehybrid.residual_scale f32 = 0.246000
|
| 38 |
+
llama_model_loader: - kv 27: granitehybrid.logit_scale f32 = 3.000000
|
| 39 |
+
llama_model_loader: - kv 28: granitehybrid.expert_shared_feed_forward_length u32 = 2048
|
| 40 |
+
llama_model_loader: - kv 29: granitehybrid.ssm.conv_kernel u32 = 4
|
| 41 |
+
llama_model_loader: - kv 30: granitehybrid.ssm.state_size u32 = 128
|
| 42 |
+
llama_model_loader: - kv 31: granitehybrid.ssm.group_count u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: granitehybrid.ssm.inner_size u32 = 1536
|
| 44 |
+
llama_model_loader: - kv 33: granitehybrid.ssm.time_step_rank u32 = 48
|
| 45 |
+
llama_model_loader: - kv 34: granitehybrid.rope.scaling.finetuned bool = false
|
| 46 |
+
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
|
| 47 |
+
llama_model_loader: - kv 36: tokenizer.ggml.pre str = dbrx
|
| 48 |
+
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ...
|
| 49 |
+
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
| 50 |
+
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
|
| 51 |
+
llama_model_loader: - kv 40: tokenizer.ggml.bos_token_id u32 = 100257
|
| 52 |
+
llama_model_loader: - kv 41: tokenizer.ggml.eos_token_id u32 = 100257
|
| 53 |
+
llama_model_loader: - kv 42: tokenizer.ggml.unknown_token_id u32 = 100269
|
| 54 |
+
llama_model_loader: - kv 43: tokenizer.ggml.padding_token_id u32 = 100256
|
| 55 |
+
llama_model_loader: - kv 44: tokenizer.ggml.add_bos_token bool = false
|
| 56 |
+
llama_model_loader: - kv 45: tokenizer.chat_template str = {%- set tools_system_message_prefix =...
|
| 57 |
+
llama_model_loader: - kv 46: general.quantization_version u32 = 2
|
| 58 |
+
llama_model_loader: - kv 47: general.file_type u32 = 38
|
| 59 |
+
llama_model_loader: - type f32: 233 tensors
|
| 60 |
+
llama_model_loader: - type q8_0: 164 tensors
|
| 61 |
+
llama_model_loader: - type q4_K: 1 tensors
|
| 62 |
+
llama_model_loader: - type mxfp4: 4 tensors
|
| 63 |
+
print_info: file format = GGUF V3 (latest)
|
| 64 |
+
print_info: file type = MXFP4 MoE
|
| 65 |
+
print_info: file size = 307.88 MiB (7.59 BPW)
|
| 66 |
+
load: printing all EOG tokens:
|
| 67 |
+
load: - 100257 ('<|end_of_text|>')
|
| 68 |
+
load: - 100261 ('<|fim_pad|>')
|
| 69 |
+
load: special tokens cache size = 96
|
| 70 |
+
load: token to piece cache size = 0.6152 MB
|
| 71 |
+
print_info: arch = granitehybrid
|
| 72 |
+
print_info: vocab_only = 0
|
| 73 |
+
print_info: n_ctx_train = 1048576
|
| 74 |
+
print_info: n_embd = 768
|
| 75 |
+
print_info: n_embd_inp = 768
|
| 76 |
+
print_info: n_layer = 32
|
| 77 |
+
print_info: n_head = 12
|
| 78 |
+
print_info: n_head_kv = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]
|
| 79 |
+
print_info: n_rot = 64
|
| 80 |
+
print_info: n_swa = 0
|
| 81 |
+
print_info: is_swa_any = 0
|
| 82 |
+
print_info: n_embd_head_k = 64
|
| 83 |
+
print_info: n_embd_head_v = 64
|
| 84 |
+
print_info: n_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0]
|
| 85 |
+
print_info: n_embd_k_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 86 |
+
print_info: n_embd_v_gqa = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 256, 0, 0, 0, 256, 0, 0, 0, 0, 0, 0, 0, 0, 0, 256, 0, 0, 0, 0]
|
| 87 |
+
print_info: f_norm_eps = 0.0e+00
|
| 88 |
+
print_info: f_norm_rms_eps = 1.0e-05
|
| 89 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 90 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 91 |
+
print_info: f_logit_scale = 3.0e+00
|
| 92 |
+
print_info: f_attn_scale = 1.6e-02
|
| 93 |
+
print_info: n_ff = 2048
|
| 94 |
+
print_info: n_expert = 0
|
| 95 |
+
print_info: n_expert_used = 0
|
| 96 |
+
print_info: n_expert_groups = 0
|
| 97 |
+
print_info: n_group_used = 0
|
| 98 |
+
print_info: causal attn = 1
|
| 99 |
+
print_info: pooling type = 0
|
| 100 |
+
print_info: rope type = 0
|
| 101 |
+
print_info: rope scaling = linear
|
| 102 |
+
print_info: freq_base_train = 10000.0
|
| 103 |
+
print_info: freq_scale_train = 1
|
| 104 |
+
print_info: n_ctx_orig_yarn = 1048576
|
| 105 |
+
print_info: rope_finetuned = unknown
|
| 106 |
+
print_info: ssm_d_conv = 4
|
| 107 |
+
print_info: ssm_d_inner = 1536
|
| 108 |
+
print_info: ssm_d_state = 128
|
| 109 |
+
print_info: ssm_dt_rank = 48
|
| 110 |
+
print_info: ssm_n_group = 1
|
| 111 |
+
print_info: ssm_dt_b_c_rms = 0
|
| 112 |
+
print_info: model type = 350M
|
| 113 |
+
print_info: model params = 340.33 M
|
| 114 |
+
print_info: general.name = Granite 4.0 H 350m Unsloth
|
| 115 |
+
print_info: f_embedding_scale = 12.000000
|
| 116 |
+
print_info: f_residual_scale = 0.246000
|
| 117 |
+
print_info: f_attention_scale = 0.015625
|
| 118 |
+
print_info: n_ff_shexp = 2048
|
| 119 |
+
print_info: vocab type = BPE
|
| 120 |
+
print_info: n_vocab = 100352
|
| 121 |
+
print_info: n_merges = 100000
|
| 122 |
+
print_info: BOS token = 100257 '<|end_of_text|>'
|
| 123 |
+
print_info: EOS token = 100257 '<|end_of_text|>'
|
| 124 |
+
print_info: EOT token = 100257 '<|end_of_text|>'
|
| 125 |
+
print_info: UNK token = 100269 '<|unk|>'
|
| 126 |
+
print_info: PAD token = 100256 '<|pad|>'
|
| 127 |
+
print_info: LF token = 198 'Ċ'
|
| 128 |
+
print_info: FIM PRE token = 100258 '<|fim_prefix|>'
|
| 129 |
+
print_info: FIM SUF token = 100260 '<|fim_suffix|>'
|
| 130 |
+
print_info: FIM MID token = 100259 '<|fim_middle|>'
|
| 131 |
+
print_info: FIM PAD token = 100261 '<|fim_pad|>'
|
| 132 |
+
print_info: EOG token = 100257 '<|end_of_text|>'
|
| 133 |
+
print_info: EOG token = 100261 '<|fim_pad|>'
|
| 134 |
+
print_info: max token length = 256
|
| 135 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 136 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 137 |
+
load_tensors: offloaded 20/33 layers to GPU
|
| 138 |
+
load_tensors: CPU_Mapped model buffer size = 142.58 MiB
|
| 139 |
+
load_tensors: CUDA0 model buffer size = 81.38 MiB
|
| 140 |
+
load_tensors: CUDA1 model buffer size = 83.95 MiB
|
| 141 |
+
........................................................................................
|
| 142 |
+
llama_context: constructing llama_context
|
| 143 |
+
llama_context: n_seq_max = 1
|
| 144 |
+
llama_context: n_ctx = 2048
|
| 145 |
+
llama_context: n_ctx_seq = 2048
|
| 146 |
+
llama_context: n_batch = 2048
|
| 147 |
+
llama_context: n_ubatch = 512
|
| 148 |
+
llama_context: causal_attn = 1
|
| 149 |
+
llama_context: flash_attn = auto
|
| 150 |
+
llama_context: kv_unified = false
|
| 151 |
+
llama_context: freq_base = 10000.0
|
| 152 |
+
llama_context: freq_scale = 1
|
| 153 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
|
| 154 |
+
llama_context: CPU output buffer size = 0.38 MiB
|
| 155 |
+
llama_kv_cache: CPU KV buffer size = 2.00 MiB
|
| 156 |
+
llama_kv_cache: CUDA0 KV buffer size = 4.00 MiB
|
| 157 |
+
llama_kv_cache: CUDA1 KV buffer size = 2.00 MiB
|
| 158 |
+
llama_kv_cache: size = 8.00 MiB ( 2048 cells, 4 layers, 1/1 seqs), K (f16): 4.00 MiB, V (f16): 4.00 MiB
|
| 159 |
+
llama_memory_recurrent: CPU RS buffer size = 8.48 MiB
|
| 160 |
+
llama_memory_recurrent: CUDA0 RS buffer size = 6.16 MiB
|
| 161 |
+
llama_memory_recurrent: CUDA1 RS buffer size = 6.93 MiB
|
| 162 |
+
llama_memory_recurrent: size = 21.57 MiB ( 1 cells, 32 layers, 1 seqs), R (f32): 0.57 MiB, S (f32): 21.00 MiB
|
| 163 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 164 |
+
llama_context: CUDA0 compute buffer size = 245.96 MiB
|
| 165 |
+
llama_context: CUDA1 compute buffer size = 22.39 MiB
|
| 166 |
+
llama_context: CUDA_Host compute buffer size = 18.34 MiB
|
| 167 |
+
llama_context: graph nodes = 1815
|
| 168 |
+
llama_context: graph splits = 182 (with bs=512), 41 (with bs=1)
|
| 169 |
+
common_init_from_params: added <|end_of_text|> logit bias = -inf
|
| 170 |
+
common_init_from_params: added <|fim_pad|> logit bias = -inf
|
| 171 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 172 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 173 |
+
|
| 174 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 175 |
+
perplexity: tokenizing the input ..
|
| 176 |
+
perplexity: tokenization took 36.607 ms
|
| 177 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 178 |
+
perplexity: 0.47 seconds per pass - ETA 0.12 minutes
|
| 179 |
+
[1]9.2686,[2]10.6572,[3]10.1095,[4]10.4956,[5]10.6916,[6]10.7519,[7]10.9102,[8]10.5709,[9]10.6279,[10]10.6134,[11]10.8732,[12]10.9752,[13]11.0971,[14]11.0725,[15]10.9570,
|
| 180 |
+
Final estimate: PPL = 10.9570 +/- 0.25057
|
| 181 |
+
|
| 182 |
+
llama_perf_context_print: load time = 200.98 ms
|
| 183 |
+
llama_perf_context_print: prompt eval time = 4834.64 ms / 30720 tokens ( 0.16 ms per token, 6354.15 tokens per second)
|
| 184 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 185 |
+
llama_perf_context_print: total time = 5094.04 ms / 30721 tokens
|
| 186 |
+
llama_perf_context_print: graphs reused = 0
|
| 187 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 188 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 21148 + ( 337 = 81 + 10 + 245) + 2620 |
|
| 189 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 23400 + ( 115 = 83 + 8 + 22) + 608 |
|
| 190 |
+
llama_memory_breakdown_print: | - Host | 171 = 142 + 10 + 18 |
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/granite-4.0-h-350m-unsloth-MXFP4_MOE-output_mxfp4-embd_q4_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|