Commit
·
fc8bd2a
1
Parent(s):
73dfc17
Add GGUF models + tokenizer with LFS
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +2 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_code.txt +151 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_general.txt +151 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_math.txt +151 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +151 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +151 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +151 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_code.txt +153 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_general.txt +153 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_math.txt +153 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_math.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_code.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_general.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_math.txt +152 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_math.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/llamabench.txt +11 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_code.txt +153 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_general.txt +153 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_math.txt +153 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_code.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_general.txt +0 -0
- Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_math.txt +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*.gguf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B F16 | 67.34 GiB | 36.15 B | CUDA | 35 | pp8 | 11.81 ± 0.26 |
|
| 9 |
+
| seed_oss 36B F16 | 67.34 GiB | 36.15 B | CUDA | 35 | tg128 | 1.55 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: general.quantization_version u32 = 2
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = seed-coder
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 0
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type f16: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = F16
|
| 48 |
+
print_info: file size = 67.34 GiB (16.00 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 68955.52 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 10300.86 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 10300.86 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 112.946 ms
|
| 138 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 16.83 seconds per pass - ETA 13.47 minutes
|
| 140 |
+
[1]1.5112,[2]1.4432,[3]1.2771,[4]1.2243,[5]1.1813,[6]1.2686,[7]1.3739,[8]1.4324,[9]1.4159,[10]1.3936,[11]1.3718,[12]1.3776,[13]1.3780,[14]1.3640,[15]1.3455,[16]1.3624,[17]1.3635,[18]1.3452,[19]1.3426,[20]1.3586,[21]1.3487,[22]1.3385,[23]1.3489,[24]1.3433,[25]1.3474,[26]1.3431,[27]1.3610,[28]1.3663,[29]1.3669,[30]1.3676,[31]1.3650,[32]1.3755,[33]1.3758,[34]1.3682,[35]1.3644,[36]1.3596,[37]1.3673,[38]1.3762,[39]1.3677,[40]1.3896,[41]1.3985,[42]1.4014,[43]1.4098,[44]1.4110,[45]1.4047,[46]1.4079,[47]1.4117,[48]1.4129,
|
| 141 |
+
Final estimate: PPL = 1.4129 +/- 0.00952
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 8471.57 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 795437.69 ms / 98304 tokens ( 8.09 ms per token, 123.58 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 801965.37 ms / 98305 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 8400 + (12208 = 10300 + 80 + 1828) + 3497 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12273 + (10574 = 10300 + 80 + 194) + 1275 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 69321 = 68955 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: general.quantization_version u32 = 2
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = seed-coder
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 0
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type f16: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = F16
|
| 48 |
+
print_info: file size = 67.34 GiB (16.00 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 68955.52 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 10300.86 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 10300.86 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 46.978 ms
|
| 138 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 16.85 seconds per pass - ETA 4.20 minutes
|
| 140 |
+
[1]7.2108,[2]8.1347,[3]8.4667,[4]8.2219,[5]8.0076,[6]6.7314,[7]5.9343,[8]5.9926,[9]6.2640,[10]6.3232,[11]6.4603,[12]6.7925,[13]6.8088,[14]6.8826,[15]6.8905,
|
| 141 |
+
Final estimate: PPL = 6.8905 +/- 0.16805
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 7899.11 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 248682.86 ms / 30720 tokens ( 8.10 ms per token, 123.53 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 249916.95 ms / 30721 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 8400 + (12208 = 10300 + 80 + 1828) + 3497 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12273 + (10574 = 10300 + 80 + 194) + 1275 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 69321 = 68955 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: general.file_type u32 = 1
|
| 34 |
+
llama_model_loader: - kv 23: general.quantization_version u32 = 2
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.pre str = seed-coder
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 0
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
|
| 42 |
+
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
|
| 43 |
+
llama_model_loader: - kv 32: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type f16: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = F16
|
| 48 |
+
print_info: file size = 67.34 GiB (16.00 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 68955.52 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 10300.86 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 10300.86 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 44.408 ms
|
| 138 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 16.82 seconds per pass - ETA 4.48 minutes
|
| 140 |
+
[1]2.6570,[2]2.8379,[3]3.2831,[4]3.5322,[5]4.0765,[6]4.3588,[7]4.5803,[8]4.7069,[9]4.8497,[10]5.0093,[11]5.0902,[12]5.1612,[13]5.2995,[14]5.4091,[15]5.4418,[16]5.4475,
|
| 141 |
+
Final estimate: PPL = 5.4475 +/- 0.12099
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 7845.77 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 265165.37 ms / 32768 tokens ( 8.09 ms per token, 123.58 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 266180.05 ms / 32769 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 8400 + (12208 = 10300 + 80 + 1828) + 3497 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12273 + (10574 = 10300 + 80 + 194) + 1275 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 69321 = 68955 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B MXFP4 MoE | 40.09 GiB | 36.15 B | CUDA | 35 | pp8 | 16.76 ± 1.58 |
|
| 9 |
+
| seed_oss 36B MXFP4 MoE | 40.09 GiB | 36.15 B | CUDA | 35 | tg128 | 2.52 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21080 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type f16: 66 tensors
|
| 46 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 40.09 GiB (9.53 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 29172.55 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5941.48 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5941.48 MiB
|
| 108 |
+
...............................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 111.27 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 10.94 seconds per pass - ETA 8.75 minutes
|
| 141 |
+
[1]1.5141,[2]1.4450,[3]1.2782,[4]1.2249,[5]1.1818,[6]1.2694,[7]1.3750,[8]1.4334,[9]1.4167,[10]1.3942,[11]1.3724,[12]1.3782,[13]1.3787,[14]1.3647,[15]1.3463,[16]1.3631,[17]1.3642,[18]1.3459,[19]1.3433,[20]1.3592,[21]1.3493,[22]1.3391,[23]1.3495,[24]1.3438,[25]1.3479,[26]1.3436,[27]1.3614,[28]1.3667,[29]1.3673,[30]1.3681,[31]1.3655,[32]1.3760,[33]1.3763,[34]1.3687,[35]1.3648,[36]1.3601,[37]1.3677,[38]1.3767,[39]1.3682,[40]1.3901,[41]1.3989,[42]1.4018,[43]1.4103,[44]1.4115,[45]1.4052,[46]1.4083,[47]1.4121,[48]1.4132,
|
| 142 |
+
Final estimate: PPL = 1.4132 +/- 0.00953
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 6489.01 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 513416.32 ms / 98304 tokens ( 5.22 ms per token, 191.47 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 514990.09 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13021 + ( 7849 = 5941 + 80 + 1828) + 3235 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 16633 + ( 6215 = 5941 + 80 + 194) + 1275 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 29538 = 29172 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20979 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type f16: 66 tensors
|
| 46 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 40.09 GiB (9.53 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 29172.55 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5941.48 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5941.48 MiB
|
| 108 |
+
...............................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 54.239 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 11.11 seconds per pass - ETA 2.77 minutes
|
| 141 |
+
[1]7.1571,[2]8.0959,[3]8.4409,[4]8.2030,[5]7.9871,[6]6.7194,[7]5.9263,[8]5.9868,[9]6.2585,[10]6.3190,[11]6.4574,[12]6.7899,[13]6.8056,[14]6.8799,[15]6.8893,
|
| 142 |
+
Final estimate: PPL = 6.8893 +/- 0.16795
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 6028.62 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 161661.00 ms / 30720 tokens ( 5.26 ms per token, 190.03 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 162156.03 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13022 + ( 7849 = 5941 + 80 + 1828) + 3235 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 16633 + ( 6215 = 5941 + 80 + 194) + 1275 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 29538 = 29172 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21079 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type f16: 66 tensors
|
| 46 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 40.09 GiB (9.53 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 29172.55 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5941.48 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5941.48 MiB
|
| 108 |
+
...............................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 1828.00 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 46.33 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 11.19 seconds per pass - ETA 2.98 minutes
|
| 141 |
+
[1]2.6690,[2]2.8472,[3]3.2879,[4]3.5394,[5]4.0797,[6]4.3630,[7]4.5842,[8]4.7104,[9]4.8537,[10]5.0139,[11]5.0934,[12]5.1646,[13]5.3027,[14]5.4117,[15]5.4443,[16]5.4508,
|
| 142 |
+
Final estimate: PPL = 5.4508 +/- 0.12108
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 6148.63 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 172451.06 ms / 32768 tokens ( 5.26 ms per token, 190.01 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 172991.07 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13019 + ( 7849 = 5941 + 80 + 1828) + 3238 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 16633 + ( 6215 = 5941 + 80 + 194) + 1275 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 29538 = 29172 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B MXFP4 MoE | 34.66 GiB | 36.15 B | CUDA | 35 | pp8 | 19.67 ± 1.97 |
|
| 9 |
+
| seed_oss 36B MXFP4 MoE | 34.66 GiB | 36.15 B | CUDA | 35 | tg128 | 2.98 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21079 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 66 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 34.66 GiB (8.24 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 24790.01 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5351.64 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5351.64 MiB
|
| 108 |
+
...................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 934.39 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 110.481 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 9.96 seconds per pass - ETA 7.97 minutes
|
| 141 |
+
[1]1.5109,[2]1.4432,[3]1.2771,[4]1.2244,[5]1.1814,[6]1.2691,[7]1.3743,[8]1.4325,[9]1.4161,[10]1.3936,[11]1.3718,[12]1.3776,[13]1.3780,[14]1.3640,[15]1.3455,[16]1.3622,[17]1.3632,[18]1.3450,[19]1.3424,[20]1.3585,[21]1.3486,[22]1.3385,[23]1.3490,[24]1.3433,[25]1.3475,[26]1.3433,[27]1.3613,[28]1.3667,[29]1.3673,[30]1.3680,[31]1.3653,[32]1.3758,[33]1.3761,[34]1.3686,[35]1.3647,[36]1.3599,[37]1.3675,[38]1.3764,[39]1.3679,[40]1.3897,[41]1.3985,[42]1.4013,[43]1.4098,[44]1.4111,[45]1.4048,[46]1.4080,[47]1.4117,[48]1.4128,
|
| 142 |
+
Final estimate: PPL = 1.4128 +/- 0.00951
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 4676.55 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 468104.34 ms / 98304 tokens ( 4.76 ms per token, 210.00 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 469801.52 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14518 + ( 6366 = 5351 + 80 + 934) + 3222 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17235 + ( 5625 = 5351 + 80 + 194) + 1262 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 25156 = 24790 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21086 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 66 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 34.66 GiB (8.24 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 24790.01 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5351.64 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5351.64 MiB
|
| 108 |
+
...................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 934.39 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 56.144 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 10.01 seconds per pass - ETA 2.50 minutes
|
| 141 |
+
[1]7.1304,[2]8.0728,[3]8.4291,[4]8.2015,[5]7.9879,[6]6.7206,[7]5.9265,[8]5.9860,[9]6.2591,[10]6.3192,[11]6.4574,[12]6.7926,[13]6.8101,[14]6.8853,[15]6.8946,
|
| 142 |
+
Final estimate: PPL = 6.8946 +/- 0.16823
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 6455.75 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 146590.98 ms / 30720 tokens ( 4.77 ms per token, 209.56 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 147068.58 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14515 + ( 6366 = 5351 + 80 + 934) + 3225 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17235 + ( 5625 = 5351 + 80 + 194) + 1262 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 25156 = 24790 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21082 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 66 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 34.66 GiB (8.24 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 24790.01 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5351.64 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5351.64 MiB
|
| 108 |
+
...................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 934.39 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 53.45 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 9.97 seconds per pass - ETA 2.65 minutes
|
| 141 |
+
[1]2.6690,[2]2.8422,[3]3.2904,[4]3.5426,[5]4.0832,[6]4.3638,[7]4.5860,[8]4.7144,[9]4.8567,[10]5.0171,[11]5.0967,[12]5.1690,[13]5.3080,[14]5.4159,[15]5.4468,[16]5.4539,
|
| 142 |
+
Final estimate: PPL = 5.4539 +/- 0.12129
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 4703.04 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 156018.54 ms / 32768 tokens ( 4.76 ms per token, 210.03 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 156524.18 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14255 + ( 6366 = 5351 + 80 + 934) + 3485 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17235 + ( 5625 = 5351 + 80 + 194) + 1262 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 25156 = 24790 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B MXFP4 MoE | 35.78 GiB | 36.15 B | CUDA | 35 | pp8 | 19.10 ± 1.81 |
|
| 9 |
+
| seed_oss 36B MXFP4 MoE | 35.78 GiB | 36.15 B | CUDA | 35 | tg128 | 2.90 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20819 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 35.78 GiB (8.50 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 25689.74 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 5472.73 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 5472.73 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1117.84 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 110.03 ms
|
| 138 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 10.13 seconds per pass - ETA 8.10 minutes
|
| 140 |
+
[1]1.5121,[2]1.4433,[3]1.2772,[4]1.2245,[5]1.1815,[6]1.2693,[7]1.3747,[8]1.4329,[9]1.4162,[10]1.3938,[11]1.3719,[12]1.3778,[13]1.3781,[14]1.3642,[15]1.3456,[16]1.3626,[17]1.3636,[18]1.3453,[19]1.3427,[20]1.3587,[21]1.3488,[22]1.3387,[23]1.3492,[24]1.3435,[25]1.3477,[26]1.3434,[27]1.3614,[28]1.3666,[29]1.3672,[30]1.3680,[31]1.3654,[32]1.3758,[33]1.3762,[34]1.3686,[35]1.3647,[36]1.3599,[37]1.3675,[38]1.3765,[39]1.3680,[40]1.3899,[41]1.3987,[42]1.4016,[43]1.4100,[44]1.4112,[45]1.4049,[46]1.4081,[47]1.4118,[48]1.4130,
|
| 141 |
+
Final estimate: PPL = 1.4130 +/- 0.00952
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 5278.59 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 476416.36 ms / 98304 tokens ( 4.85 ms per token, 206.34 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 477987.89 ms / 98305 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13949 + ( 6670 = 5472 + 80 + 1117) + 3486 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17113 + ( 5746 = 5472 + 80 + 194) + 1263 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 26055 = 25689 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20819 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 35.78 GiB (8.50 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 25689.74 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 5472.73 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 5472.73 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1117.84 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 47.711 ms
|
| 138 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 10.15 seconds per pass - ETA 2.53 minutes
|
| 140 |
+
[1]7.1724,[2]8.1046,[3]8.4375,[4]8.2030,[5]7.9884,[6]6.7193,[7]5.9253,[8]5.9835,[9]6.2553,[10]6.3142,[11]6.4535,[12]6.7854,[13]6.8012,[14]6.8765,[15]6.8866,
|
| 141 |
+
Final estimate: PPL = 6.8866 +/- 0.16788
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 5411.54 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 148859.23 ms / 30720 tokens ( 4.85 ms per token, 206.37 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 149701.96 ms / 30721 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13951 + ( 6670 = 5472 + 80 + 1117) + 3484 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17113 + ( 5746 = 5472 + 80 + 194) + 1263 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 26055 = 25689 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt
ADDED
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20817 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 450 tensors
|
| 46 |
+
print_info: file format = GGUF V3 (latest)
|
| 47 |
+
print_info: file type = MXFP4 MoE
|
| 48 |
+
print_info: file size = 35.78 GiB (8.50 BPW)
|
| 49 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 50 |
+
load: printing all EOG tokens:
|
| 51 |
+
load: - 2 ('<seed:eos>')
|
| 52 |
+
load: special tokens cache size = 128
|
| 53 |
+
load: token to piece cache size = 0.9296 MB
|
| 54 |
+
print_info: arch = seed_oss
|
| 55 |
+
print_info: vocab_only = 0
|
| 56 |
+
print_info: n_ctx_train = 524288
|
| 57 |
+
print_info: n_embd = 5120
|
| 58 |
+
print_info: n_embd_inp = 5120
|
| 59 |
+
print_info: n_layer = 64
|
| 60 |
+
print_info: n_head = 80
|
| 61 |
+
print_info: n_head_kv = 8
|
| 62 |
+
print_info: n_rot = 128
|
| 63 |
+
print_info: n_swa = 0
|
| 64 |
+
print_info: is_swa_any = 0
|
| 65 |
+
print_info: n_embd_head_k = 128
|
| 66 |
+
print_info: n_embd_head_v = 128
|
| 67 |
+
print_info: n_gqa = 10
|
| 68 |
+
print_info: n_embd_k_gqa = 1024
|
| 69 |
+
print_info: n_embd_v_gqa = 1024
|
| 70 |
+
print_info: f_norm_eps = 0.0e+00
|
| 71 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 72 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 73 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 74 |
+
print_info: f_logit_scale = 0.0e+00
|
| 75 |
+
print_info: f_attn_scale = 0.0e+00
|
| 76 |
+
print_info: n_ff = 27648
|
| 77 |
+
print_info: n_expert = 0
|
| 78 |
+
print_info: n_expert_used = 0
|
| 79 |
+
print_info: n_expert_groups = 0
|
| 80 |
+
print_info: n_group_used = 0
|
| 81 |
+
print_info: causal attn = 1
|
| 82 |
+
print_info: pooling type = 0
|
| 83 |
+
print_info: rope type = 2
|
| 84 |
+
print_info: rope scaling = linear
|
| 85 |
+
print_info: freq_base_train = 10000000.0
|
| 86 |
+
print_info: freq_scale_train = 1
|
| 87 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 88 |
+
print_info: rope_finetuned = unknown
|
| 89 |
+
print_info: model type = 36B
|
| 90 |
+
print_info: model params = 36.15 B
|
| 91 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 92 |
+
print_info: vocab type = BPE
|
| 93 |
+
print_info: n_vocab = 155136
|
| 94 |
+
print_info: n_merges = 154737
|
| 95 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 96 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 97 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 98 |
+
print_info: LF token = 326 'Ċ'
|
| 99 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 100 |
+
print_info: max token length = 1024
|
| 101 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 102 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 103 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 104 |
+
load_tensors: CPU_Mapped model buffer size = 25689.74 MiB
|
| 105 |
+
load_tensors: CUDA0 model buffer size = 5472.73 MiB
|
| 106 |
+
load_tensors: CUDA1 model buffer size = 5472.73 MiB
|
| 107 |
+
..................................................................................................
|
| 108 |
+
llama_context: constructing llama_context
|
| 109 |
+
llama_context: n_seq_max = 1
|
| 110 |
+
llama_context: n_ctx = 2048
|
| 111 |
+
llama_context: n_ctx_seq = 2048
|
| 112 |
+
llama_context: n_batch = 2048
|
| 113 |
+
llama_context: n_ubatch = 512
|
| 114 |
+
llama_context: causal_attn = 1
|
| 115 |
+
llama_context: flash_attn = auto
|
| 116 |
+
llama_context: kv_unified = false
|
| 117 |
+
llama_context: freq_base = 10000000.0
|
| 118 |
+
llama_context: freq_scale = 1
|
| 119 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 120 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 121 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 122 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 125 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 126 |
+
llama_context: CUDA0 compute buffer size = 1117.84 MiB
|
| 127 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 128 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 129 |
+
llama_context: graph nodes = 2183
|
| 130 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 131 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 132 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 133 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 134 |
+
|
| 135 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 136 |
+
perplexity: tokenizing the input ..
|
| 137 |
+
perplexity: tokenization took 45.348 ms
|
| 138 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 139 |
+
perplexity: 10.14 seconds per pass - ETA 2.70 minutes
|
| 140 |
+
[1]2.6609,[2]2.8390,[3]3.2814,[4]3.5349,[5]4.0776,[6]4.3592,[7]4.5794,[8]4.7074,[9]4.8513,[10]5.0108,[11]5.0902,[12]5.1610,[13]5.2988,[14]5.4088,[15]5.4411,[16]5.4474,
|
| 141 |
+
Final estimate: PPL = 5.4474 +/- 0.12099
|
| 142 |
+
|
| 143 |
+
llama_perf_context_print: load time = 4891.84 ms
|
| 144 |
+
llama_perf_context_print: prompt eval time = 158836.02 ms / 32768 tokens ( 4.85 ms per token, 206.30 tokens per second)
|
| 145 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 146 |
+
llama_perf_context_print: total time = 159331.15 ms / 32769 tokens
|
| 147 |
+
llama_perf_context_print: graphs reused = 0
|
| 148 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 149 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13949 + ( 6670 = 5472 + 80 + 1117) + 3486 |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17113 + ( 5746 = 5472 + 80 + 194) + 1263 |
|
| 151 |
+
llama_memory_breakdown_print: | - Host | 26055 = 25689 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B MXFP4 MoE | 33.54 GiB | 36.15 B | CUDA | 35 | pp8 | 20.55 ± 0.66 |
|
| 9 |
+
| seed_oss 36B MXFP4 MoE | 33.54 GiB | 36.15 B | CUDA | 35 | tg128 | 3.11 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20815 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 47 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = MXFP4 MoE
|
| 50 |
+
print_info: file size = 33.54 GiB (7.97 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 23935.11 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 5207.11 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 5207.11 MiB
|
| 109 |
+
....................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 110.913 ms
|
| 140 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 9.71 seconds per pass - ETA 7.77 minutes
|
| 142 |
+
[1]1.5039,[2]1.4366,[3]1.2732,[4]1.2227,[5]1.1808,[6]1.2697,[7]1.3771,[8]1.4370,[9]1.4212,[10]1.3989,[11]1.3764,[12]1.3833,[13]1.3834,[14]1.3692,[15]1.3523,[16]1.3691,[17]1.3709,[18]1.3525,[19]1.3494,[20]1.3651,[21]1.3555,[22]1.3455,[23]1.3559,[24]1.3503,[25]1.3543,[26]1.3507,[27]1.3683,[28]1.3735,[29]1.3738,[30]1.3740,[31]1.3711,[32]1.3820,[33]1.3821,[34]1.3743,[35]1.3703,[36]1.3658,[37]1.3738,[38]1.3828,[39]1.3743,[40]1.3961,[41]1.4052,[42]1.4080,[43]1.4167,[44]1.4181,[45]1.4116,[46]1.4146,[47]1.4186,[48]1.4199,
|
| 143 |
+
Final estimate: PPL = 1.4199 +/- 0.00955
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 4967.46 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 456537.28 ms / 98304 tokens ( 4.64 ms per token, 215.33 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 458289.54 ms / 98305 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14616 + ( 6002 = 5207 + 80 + 715) + 3487 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 24301 = 23935 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20817 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 47 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = MXFP4 MoE
|
| 50 |
+
print_info: file size = 33.54 GiB (7.97 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 23935.11 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 5207.11 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 5207.11 MiB
|
| 109 |
+
....................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 48.496 ms
|
| 140 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 9.74 seconds per pass - ETA 2.43 minutes
|
| 142 |
+
[1]7.1500,[2]8.2084,[3]8.6095,[4]8.2993,[5]8.0678,[6]6.7890,[7]5.9811,[8]6.0378,[9]6.3209,[10]6.3818,[11]6.5155,[12]6.8651,[13]6.8791,[14]6.9610,[15]6.9649,
|
| 143 |
+
Final estimate: PPL = 6.9649 +/- 0.16907
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 4554.25 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 142787.36 ms / 30720 tokens ( 4.65 ms per token, 215.15 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 143251.19 ms / 30721 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14615 + ( 6002 = 5207 + 80 + 715) + 3489 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 24301 = 23935 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20816 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 384 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 1 tensors
|
| 47 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = MXFP4 MoE
|
| 50 |
+
print_info: file size = 33.54 GiB (7.97 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 23935.11 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 5207.11 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 5207.11 MiB
|
| 109 |
+
....................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 44.607 ms
|
| 140 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 9.70 seconds per pass - ETA 2.58 minutes
|
| 142 |
+
[1]2.6818,[2]2.8306,[3]3.2874,[4]3.5465,[5]4.1012,[6]4.3894,[7]4.6191,[8]4.7375,[9]4.8855,[10]5.0570,[11]5.1436,[12]5.2350,[13]5.3773,[14]5.4922,[15]5.5241,[16]5.5327,
|
| 143 |
+
Final estimate: PPL = 5.5327 +/- 0.12265
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 4565.69 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 151956.96 ms / 32768 tokens ( 4.64 ms per token, 215.64 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 152450.77 ms / 32769 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14620 + ( 6002 = 5207 + 80 + 715) + 3483 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 24301 = 23935 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B MXFP4 MoE | 33.72 GiB | 36.15 B | CUDA | 35 | pp8 | 19.48 ± 1.28 |
|
| 9 |
+
| seed_oss 36B MXFP4 MoE | 33.72 GiB | 36.15 B | CUDA | 35 | tg128 | 3.09 ± 0.01 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_code.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 385 tensors
|
| 46 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 33.72 GiB (8.01 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 24118.57 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5207.11 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5207.11 MiB
|
| 108 |
+
...................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 110.758 ms
|
| 139 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 9.71 seconds per pass - ETA 7.77 minutes
|
| 141 |
+
[1]1.5028,[2]1.4359,[3]1.2728,[4]1.2224,[5]1.1806,[6]1.2696,[7]1.3771,[8]1.4369,[9]1.4213,[10]1.3990,[11]1.3766,[12]1.3833,[13]1.3833,[14]1.3692,[15]1.3523,[16]1.3690,[17]1.3708,[18]1.3524,[19]1.3493,[20]1.3652,[21]1.3556,[22]1.3455,[23]1.3559,[24]1.3502,[25]1.3543,[26]1.3506,[27]1.3683,[28]1.3733,[29]1.3736,[30]1.3738,[31]1.3709,[32]1.3817,[33]1.3819,[34]1.3740,[35]1.3700,[36]1.3655,[37]1.3736,[38]1.3826,[39]1.3741,[40]1.3959,[41]1.4050,[42]1.4078,[43]1.4165,[44]1.4179,[45]1.4114,[46]1.4144,[47]1.4184,[48]1.4198,
|
| 142 |
+
Final estimate: PPL = 1.4198 +/- 0.00955
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 4600.76 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 454918.27 ms / 98304 tokens ( 4.63 ms per token, 216.09 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 456481.77 ms / 98305 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14620 + ( 6002 = 5207 + 80 + 715) + 3483 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 24484 = 24118 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_general.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 385 tensors
|
| 46 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 33.72 GiB (8.01 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 24118.57 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5207.11 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5207.11 MiB
|
| 108 |
+
...................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 68.738 ms
|
| 139 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 17.37 seconds per pass - ETA 4.33 minutes
|
| 141 |
+
[1]7.1376,[2]8.2054,[3]8.6103,[4]8.2935,[5]8.0625,[6]6.7858,[7]5.9786,[8]6.0365,[9]6.3174,[10]6.3803,[11]6.5135,[12]6.8626,[13]6.8769,[14]6.9592,[15]6.9638,
|
| 142 |
+
Final estimate: PPL = 6.9638 +/- 0.16907
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 8554.11 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 150106.42 ms / 30720 tokens ( 4.89 ms per token, 204.65 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 150654.44 ms / 30721 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14620 + ( 6002 = 5207 + 80 + 715) + 3483 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 24484 = 24118 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_math.txt
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 385 tensors
|
| 46 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 47 |
+
print_info: file format = GGUF V3 (latest)
|
| 48 |
+
print_info: file type = MXFP4 MoE
|
| 49 |
+
print_info: file size = 33.72 GiB (8.01 BPW)
|
| 50 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 51 |
+
load: printing all EOG tokens:
|
| 52 |
+
load: - 2 ('<seed:eos>')
|
| 53 |
+
load: special tokens cache size = 128
|
| 54 |
+
load: token to piece cache size = 0.9296 MB
|
| 55 |
+
print_info: arch = seed_oss
|
| 56 |
+
print_info: vocab_only = 0
|
| 57 |
+
print_info: n_ctx_train = 524288
|
| 58 |
+
print_info: n_embd = 5120
|
| 59 |
+
print_info: n_embd_inp = 5120
|
| 60 |
+
print_info: n_layer = 64
|
| 61 |
+
print_info: n_head = 80
|
| 62 |
+
print_info: n_head_kv = 8
|
| 63 |
+
print_info: n_rot = 128
|
| 64 |
+
print_info: n_swa = 0
|
| 65 |
+
print_info: is_swa_any = 0
|
| 66 |
+
print_info: n_embd_head_k = 128
|
| 67 |
+
print_info: n_embd_head_v = 128
|
| 68 |
+
print_info: n_gqa = 10
|
| 69 |
+
print_info: n_embd_k_gqa = 1024
|
| 70 |
+
print_info: n_embd_v_gqa = 1024
|
| 71 |
+
print_info: f_norm_eps = 0.0e+00
|
| 72 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 73 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 74 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 75 |
+
print_info: f_logit_scale = 0.0e+00
|
| 76 |
+
print_info: f_attn_scale = 0.0e+00
|
| 77 |
+
print_info: n_ff = 27648
|
| 78 |
+
print_info: n_expert = 0
|
| 79 |
+
print_info: n_expert_used = 0
|
| 80 |
+
print_info: n_expert_groups = 0
|
| 81 |
+
print_info: n_group_used = 0
|
| 82 |
+
print_info: causal attn = 1
|
| 83 |
+
print_info: pooling type = 0
|
| 84 |
+
print_info: rope type = 2
|
| 85 |
+
print_info: rope scaling = linear
|
| 86 |
+
print_info: freq_base_train = 10000000.0
|
| 87 |
+
print_info: freq_scale_train = 1
|
| 88 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 89 |
+
print_info: rope_finetuned = unknown
|
| 90 |
+
print_info: model type = 36B
|
| 91 |
+
print_info: model params = 36.15 B
|
| 92 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 93 |
+
print_info: vocab type = BPE
|
| 94 |
+
print_info: n_vocab = 155136
|
| 95 |
+
print_info: n_merges = 154737
|
| 96 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 97 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 98 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 99 |
+
print_info: LF token = 326 'Ċ'
|
| 100 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 101 |
+
print_info: max token length = 1024
|
| 102 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 103 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 104 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 105 |
+
load_tensors: CPU_Mapped model buffer size = 24118.57 MiB
|
| 106 |
+
load_tensors: CUDA0 model buffer size = 5207.11 MiB
|
| 107 |
+
load_tensors: CUDA1 model buffer size = 5207.11 MiB
|
| 108 |
+
...................................................................................................
|
| 109 |
+
llama_context: constructing llama_context
|
| 110 |
+
llama_context: n_seq_max = 1
|
| 111 |
+
llama_context: n_ctx = 2048
|
| 112 |
+
llama_context: n_ctx_seq = 2048
|
| 113 |
+
llama_context: n_batch = 2048
|
| 114 |
+
llama_context: n_ubatch = 512
|
| 115 |
+
llama_context: causal_attn = 1
|
| 116 |
+
llama_context: flash_attn = auto
|
| 117 |
+
llama_context: kv_unified = false
|
| 118 |
+
llama_context: freq_base = 10000000.0
|
| 119 |
+
llama_context: freq_scale = 1
|
| 120 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 121 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 122 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 123 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 126 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 127 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 128 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 129 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 130 |
+
llama_context: graph nodes = 2183
|
| 131 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 132 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 133 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 134 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 135 |
+
|
| 136 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 137 |
+
perplexity: tokenizing the input ..
|
| 138 |
+
perplexity: tokenization took 43.7 ms
|
| 139 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 140 |
+
perplexity: 9.69 seconds per pass - ETA 2.58 minutes
|
| 141 |
+
[1]2.6870,[2]2.8316,[3]3.2916,[4]3.5489,[5]4.1042,[6]4.3929,[7]4.6231,[8]4.7409,[9]4.8891,[10]5.0611,[11]5.1465,[12]5.2375,[13]5.3797,[14]5.4944,[15]5.5263,[16]5.5341,
|
| 142 |
+
Final estimate: PPL = 5.5341 +/- 0.12273
|
| 143 |
+
|
| 144 |
+
llama_perf_context_print: load time = 4608.15 ms
|
| 145 |
+
llama_perf_context_print: prompt eval time = 151699.18 ms / 32768 tokens ( 4.63 ms per token, 216.01 tokens per second)
|
| 146 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 147 |
+
llama_perf_context_print: total time = 152194.46 ms / 32769 tokens
|
| 148 |
+
llama_perf_context_print: graphs reused = 0
|
| 149 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 150 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14627 + ( 6002 = 5207 + 80 + 715) + 3476 |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
|
| 152 |
+
llama_memory_breakdown_print: | - Host | 24484 = 24118 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/llamabench.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
| model | size | params | backend | ngl | test | t/s |
|
| 7 |
+
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
|
| 8 |
+
| seed_oss 36B MXFP4 MoE | 31.50 GiB | 36.15 B | CUDA | 35 | pp8 | 21.61 ± 0.31 |
|
| 9 |
+
| seed_oss 36B MXFP4 MoE | 31.50 GiB | 36.15 B | CUDA | 35 | tg128 | 3.30 ± 0.00 |
|
| 10 |
+
|
| 11 |
+
build: 92bb442ad (7040)
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_code.txt
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20825 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 320 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 65 tensors
|
| 47 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = MXFP4 MoE
|
| 50 |
+
print_info: file size = 31.50 GiB (7.48 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 22496.52 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 4880.16 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 4880.16 MiB
|
| 109 |
+
...................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 110.832 ms
|
| 140 |
+
perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 9.34 seconds per pass - ETA 7.47 minutes
|
| 142 |
+
[1]1.5043,[2]1.4364,[3]1.2731,[4]1.2222,[5]1.1804,[6]1.2691,[7]1.3768,[8]1.4367,[9]1.4211,[10]1.3988,[11]1.3764,[12]1.3832,[13]1.3832,[14]1.3690,[15]1.3520,[16]1.3687,[17]1.3707,[18]1.3523,[19]1.3493,[20]1.3651,[21]1.3555,[22]1.3454,[23]1.3557,[24]1.3501,[25]1.3542,[26]1.3505,[27]1.3680,[28]1.3731,[29]1.3734,[30]1.3736,[31]1.3708,[32]1.3817,[33]1.3819,[34]1.3740,[35]1.3700,[36]1.3654,[37]1.3735,[38]1.3825,[39]1.3740,[40]1.3959,[41]1.4049,[42]1.4077,[43]1.4163,[44]1.4177,[45]1.4112,[46]1.4143,[47]1.4182,[48]1.4196,
|
| 143 |
+
Final estimate: PPL = 1.4196 +/- 0.00955
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 4289.82 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 437290.20 ms / 98304 tokens ( 4.45 ms per token, 224.80 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 438753.19 ms / 98305 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14946 + ( 5675 = 4880 + 80 + 715) + 3484 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17705 + ( 5154 = 4880 + 80 + 194) + 1264 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 22862 = 22496 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_general.txt
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20825 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 320 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 65 tensors
|
| 47 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = MXFP4 MoE
|
| 50 |
+
print_info: file size = 31.50 GiB (7.48 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 22496.52 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 4880.16 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 4880.16 MiB
|
| 109 |
+
...................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 48.169 ms
|
| 140 |
+
perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 9.35 seconds per pass - ETA 2.33 minutes
|
| 142 |
+
[1]7.1294,[2]8.1907,[3]8.6017,[4]8.2992,[5]8.0687,[6]6.7895,[7]5.9829,[8]6.0366,[9]6.3191,[10]6.3793,[11]6.5139,[12]6.8639,[13]6.8784,[14]6.9598,[15]6.9647,
|
| 143 |
+
Final estimate: PPL = 6.9647 +/- 0.16904
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 4633.11 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 136636.03 ms / 30720 tokens ( 4.45 ms per token, 224.83 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 137103.47 ms / 30721 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14951 + ( 5675 = 4880 + 80 + 715) + 3479 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17705 + ( 5154 = 4880 + 80 + 194) + 1264 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 22862 = 22496 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_math.txt
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
| 2 |
+
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
| 3 |
+
ggml_cuda_init: found 2 CUDA devices:
|
| 4 |
+
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 5 |
+
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
|
| 6 |
+
build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
|
| 7 |
+
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
|
| 8 |
+
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
|
| 9 |
+
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
|
| 10 |
+
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
| 11 |
+
llama_model_loader: - kv 0: general.architecture str = seed_oss
|
| 12 |
+
llama_model_loader: - kv 1: general.type str = model
|
| 13 |
+
llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
|
| 14 |
+
llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
|
| 15 |
+
llama_model_loader: - kv 4: general.basename str = Seed-OSS
|
| 16 |
+
llama_model_loader: - kv 5: general.size_label str = 36B
|
| 17 |
+
llama_model_loader: - kv 6: general.license str = apache-2.0
|
| 18 |
+
llama_model_loader: - kv 7: general.base_model.count u32 = 1
|
| 19 |
+
llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
|
| 20 |
+
llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
|
| 21 |
+
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
|
| 22 |
+
llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
|
| 23 |
+
llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
|
| 24 |
+
llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
|
| 25 |
+
llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
|
| 26 |
+
llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
|
| 27 |
+
llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
|
| 28 |
+
llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
|
| 29 |
+
llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
|
| 30 |
+
llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
|
| 31 |
+
llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
|
| 32 |
+
llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
|
| 33 |
+
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
|
| 34 |
+
llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
|
| 35 |
+
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
|
| 36 |
+
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
|
| 37 |
+
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
|
| 38 |
+
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
|
| 39 |
+
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
|
| 40 |
+
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
|
| 41 |
+
llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
|
| 42 |
+
llama_model_loader: - kv 31: general.quantization_version u32 = 2
|
| 43 |
+
llama_model_loader: - kv 32: general.file_type u32 = 38
|
| 44 |
+
llama_model_loader: - type f32: 321 tensors
|
| 45 |
+
llama_model_loader: - type q8_0: 320 tensors
|
| 46 |
+
llama_model_loader: - type q6_K: 65 tensors
|
| 47 |
+
llama_model_loader: - type mxfp4: 65 tensors
|
| 48 |
+
print_info: file format = GGUF V3 (latest)
|
| 49 |
+
print_info: file type = MXFP4 MoE
|
| 50 |
+
print_info: file size = 31.50 GiB (7.48 BPW)
|
| 51 |
+
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
|
| 52 |
+
load: printing all EOG tokens:
|
| 53 |
+
load: - 2 ('<seed:eos>')
|
| 54 |
+
load: special tokens cache size = 128
|
| 55 |
+
load: token to piece cache size = 0.9296 MB
|
| 56 |
+
print_info: arch = seed_oss
|
| 57 |
+
print_info: vocab_only = 0
|
| 58 |
+
print_info: n_ctx_train = 524288
|
| 59 |
+
print_info: n_embd = 5120
|
| 60 |
+
print_info: n_embd_inp = 5120
|
| 61 |
+
print_info: n_layer = 64
|
| 62 |
+
print_info: n_head = 80
|
| 63 |
+
print_info: n_head_kv = 8
|
| 64 |
+
print_info: n_rot = 128
|
| 65 |
+
print_info: n_swa = 0
|
| 66 |
+
print_info: is_swa_any = 0
|
| 67 |
+
print_info: n_embd_head_k = 128
|
| 68 |
+
print_info: n_embd_head_v = 128
|
| 69 |
+
print_info: n_gqa = 10
|
| 70 |
+
print_info: n_embd_k_gqa = 1024
|
| 71 |
+
print_info: n_embd_v_gqa = 1024
|
| 72 |
+
print_info: f_norm_eps = 0.0e+00
|
| 73 |
+
print_info: f_norm_rms_eps = 1.0e-06
|
| 74 |
+
print_info: f_clamp_kqv = 0.0e+00
|
| 75 |
+
print_info: f_max_alibi_bias = 0.0e+00
|
| 76 |
+
print_info: f_logit_scale = 0.0e+00
|
| 77 |
+
print_info: f_attn_scale = 0.0e+00
|
| 78 |
+
print_info: n_ff = 27648
|
| 79 |
+
print_info: n_expert = 0
|
| 80 |
+
print_info: n_expert_used = 0
|
| 81 |
+
print_info: n_expert_groups = 0
|
| 82 |
+
print_info: n_group_used = 0
|
| 83 |
+
print_info: causal attn = 1
|
| 84 |
+
print_info: pooling type = 0
|
| 85 |
+
print_info: rope type = 2
|
| 86 |
+
print_info: rope scaling = linear
|
| 87 |
+
print_info: freq_base_train = 10000000.0
|
| 88 |
+
print_info: freq_scale_train = 1
|
| 89 |
+
print_info: n_ctx_orig_yarn = 524288
|
| 90 |
+
print_info: rope_finetuned = unknown
|
| 91 |
+
print_info: model type = 36B
|
| 92 |
+
print_info: model params = 36.15 B
|
| 93 |
+
print_info: general.name = Seed OSS 36B Instruct Unsloth
|
| 94 |
+
print_info: vocab type = BPE
|
| 95 |
+
print_info: n_vocab = 155136
|
| 96 |
+
print_info: n_merges = 154737
|
| 97 |
+
print_info: BOS token = 0 '<seed:bos>'
|
| 98 |
+
print_info: EOS token = 2 '<seed:eos>'
|
| 99 |
+
print_info: PAD token = 1 '<seed:pad>'
|
| 100 |
+
print_info: LF token = 326 'Ċ'
|
| 101 |
+
print_info: EOG token = 2 '<seed:eos>'
|
| 102 |
+
print_info: max token length = 1024
|
| 103 |
+
load_tensors: loading model tensors, this can take a while... (mmap = true)
|
| 104 |
+
load_tensors: offloading 20 repeating layers to GPU
|
| 105 |
+
load_tensors: offloaded 20/65 layers to GPU
|
| 106 |
+
load_tensors: CPU_Mapped model buffer size = 22496.52 MiB
|
| 107 |
+
load_tensors: CUDA0 model buffer size = 4880.16 MiB
|
| 108 |
+
load_tensors: CUDA1 model buffer size = 4880.16 MiB
|
| 109 |
+
...................................................................................................
|
| 110 |
+
llama_context: constructing llama_context
|
| 111 |
+
llama_context: n_seq_max = 1
|
| 112 |
+
llama_context: n_ctx = 2048
|
| 113 |
+
llama_context: n_ctx_seq = 2048
|
| 114 |
+
llama_context: n_batch = 2048
|
| 115 |
+
llama_context: n_ubatch = 512
|
| 116 |
+
llama_context: causal_attn = 1
|
| 117 |
+
llama_context: flash_attn = auto
|
| 118 |
+
llama_context: kv_unified = false
|
| 119 |
+
llama_context: freq_base = 10000000.0
|
| 120 |
+
llama_context: freq_scale = 1
|
| 121 |
+
llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
|
| 122 |
+
llama_context: CPU output buffer size = 0.59 MiB
|
| 123 |
+
llama_kv_cache: CPU KV buffer size = 352.00 MiB
|
| 124 |
+
llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
|
| 125 |
+
llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
|
| 126 |
+
llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
|
| 127 |
+
llama_context: Flash Attention was auto, set to enabled
|
| 128 |
+
llama_context: CUDA0 compute buffer size = 715.42 MiB
|
| 129 |
+
llama_context: CUDA1 compute buffer size = 194.01 MiB
|
| 130 |
+
llama_context: CUDA_Host compute buffer size = 14.01 MiB
|
| 131 |
+
llama_context: graph nodes = 2183
|
| 132 |
+
llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
|
| 133 |
+
common_init_from_params: added <seed:eos> logit bias = -inf
|
| 134 |
+
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
|
| 135 |
+
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
|
| 136 |
+
|
| 137 |
+
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
|
| 138 |
+
perplexity: tokenizing the input ..
|
| 139 |
+
perplexity: tokenization took 45.232 ms
|
| 140 |
+
perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
|
| 141 |
+
perplexity: 9.33 seconds per pass - ETA 2.48 minutes
|
| 142 |
+
[1]2.6803,[2]2.8322,[3]3.2930,[4]3.5483,[5]4.1021,[6]4.3899,[7]4.6199,[8]4.7387,[9]4.8877,[10]5.0587,[11]5.1451,[12]5.2368,[13]5.3790,[14]5.4929,[15]5.5253,[16]5.5326,
|
| 143 |
+
Final estimate: PPL = 5.5326 +/- 0.12270
|
| 144 |
+
|
| 145 |
+
llama_perf_context_print: load time = 4294.54 ms
|
| 146 |
+
llama_perf_context_print: prompt eval time = 145679.92 ms / 32768 tokens ( 4.45 ms per token, 224.93 tokens per second)
|
| 147 |
+
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
|
| 148 |
+
llama_perf_context_print: total time = 146170.11 ms / 32769 tokens
|
| 149 |
+
llama_perf_context_print: graphs reused = 0
|
| 150 |
+
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
| 151 |
+
llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14946 + ( 5675 = 4880 + 80 + 715) + 3484 |
|
| 152 |
+
llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17705 + ( 5154 = 4880 + 80 + 194) + 1264 |
|
| 153 |
+
llama_memory_breakdown_print: | - Host | 22862 = 22496 + 352 + 14 |
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_code.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_general.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_math.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|