magiccodingman commited on
Commit
fc8bd2a
·
1 Parent(s): 73dfc17

Add GGUF models + tokenizer with LFS

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +2 -0
  2. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/llamabench.txt +11 -0
  3. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_code.txt +151 -0
  4. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_general.txt +151 -0
  5. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_math.txt +151 -0
  6. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_code.txt +0 -0
  7. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_general.txt +0 -0
  8. Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_math.txt +0 -0
  9. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
  10. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt +152 -0
  11. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt +152 -0
  12. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt +152 -0
  13. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
  14. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
  15. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
  16. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
  17. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +152 -0
  18. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +152 -0
  19. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +152 -0
  20. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
  21. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
  22. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
  23. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
  24. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +151 -0
  25. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +151 -0
  26. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +151 -0
  27. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
  28. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
  29. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
  30. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/llamabench.txt +11 -0
  31. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_code.txt +153 -0
  32. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_general.txt +153 -0
  33. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_math.txt +153 -0
  34. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_code.txt +0 -0
  35. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_general.txt +0 -0
  36. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_math.txt +0 -0
  37. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/llamabench.txt +11 -0
  38. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_code.txt +152 -0
  39. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_general.txt +152 -0
  40. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_math.txt +152 -0
  41. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_code.txt +0 -0
  42. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_general.txt +0 -0
  43. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_math.txt +0 -0
  44. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/llamabench.txt +11 -0
  45. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_code.txt +153 -0
  46. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_general.txt +153 -0
  47. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_math.txt +153 -0
  48. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_code.txt +0 -0
  49. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_general.txt +0 -0
  50. Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_math.txt +0 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.gguf filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B F16 | 67.34 GiB | 36.15 B | CUDA | 35 | pp8 | 11.81 ± 0.26 |
9
+ | seed_oss 36B F16 | 67.34 GiB | 36.15 B | CUDA | 35 | tg128 | 1.55 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_code.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: general.file_type u32 = 1
34
+ llama_model_loader: - kv 23: general.quantization_version u32 = 2
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = seed-coder
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 0
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
43
+ llama_model_loader: - kv 32: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type f16: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = F16
48
+ print_info: file size = 67.34 GiB (16.00 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 68955.52 MiB
105
+ load_tensors: CUDA0 model buffer size = 10300.86 MiB
106
+ load_tensors: CUDA1 model buffer size = 10300.86 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 112.946 ms
138
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 16.83 seconds per pass - ETA 13.47 minutes
140
+ [1]1.5112,[2]1.4432,[3]1.2771,[4]1.2243,[5]1.1813,[6]1.2686,[7]1.3739,[8]1.4324,[9]1.4159,[10]1.3936,[11]1.3718,[12]1.3776,[13]1.3780,[14]1.3640,[15]1.3455,[16]1.3624,[17]1.3635,[18]1.3452,[19]1.3426,[20]1.3586,[21]1.3487,[22]1.3385,[23]1.3489,[24]1.3433,[25]1.3474,[26]1.3431,[27]1.3610,[28]1.3663,[29]1.3669,[30]1.3676,[31]1.3650,[32]1.3755,[33]1.3758,[34]1.3682,[35]1.3644,[36]1.3596,[37]1.3673,[38]1.3762,[39]1.3677,[40]1.3896,[41]1.3985,[42]1.4014,[43]1.4098,[44]1.4110,[45]1.4047,[46]1.4079,[47]1.4117,[48]1.4129,
141
+ Final estimate: PPL = 1.4129 +/- 0.00952
142
+
143
+ llama_perf_context_print: load time = 8471.57 ms
144
+ llama_perf_context_print: prompt eval time = 795437.69 ms / 98304 tokens ( 8.09 ms per token, 123.58 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 801965.37 ms / 98305 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 8400 + (12208 = 10300 + 80 + 1828) + 3497 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12273 + (10574 = 10300 + 80 + 194) + 1275 |
151
+ llama_memory_breakdown_print: | - Host | 69321 = 68955 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_general.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: general.file_type u32 = 1
34
+ llama_model_loader: - kv 23: general.quantization_version u32 = 2
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = seed-coder
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 0
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
43
+ llama_model_loader: - kv 32: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type f16: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = F16
48
+ print_info: file size = 67.34 GiB (16.00 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 68955.52 MiB
105
+ load_tensors: CUDA0 model buffer size = 10300.86 MiB
106
+ load_tensors: CUDA1 model buffer size = 10300.86 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 46.978 ms
138
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 16.85 seconds per pass - ETA 4.20 minutes
140
+ [1]7.2108,[2]8.1347,[3]8.4667,[4]8.2219,[5]8.0076,[6]6.7314,[7]5.9343,[8]5.9926,[9]6.2640,[10]6.3232,[11]6.4603,[12]6.7925,[13]6.8088,[14]6.8826,[15]6.8905,
141
+ Final estimate: PPL = 6.8905 +/- 0.16805
142
+
143
+ llama_perf_context_print: load time = 7899.11 ms
144
+ llama_perf_context_print: prompt eval time = 248682.86 ms / 30720 tokens ( 8.10 ms per token, 123.53 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 249916.95 ms / 30721 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 8400 + (12208 = 10300 + 80 + 1828) + 3497 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12273 + (10574 = 10300 + 80 + 194) + 1275 |
151
+ llama_memory_breakdown_print: | - Host | 69321 = 68955 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_math.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: general.file_type u32 = 1
34
+ llama_model_loader: - kv 23: general.quantization_version u32 = 2
35
+ llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
36
+ llama_model_loader: - kv 25: tokenizer.ggml.pre str = seed-coder
37
+ llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
39
+ llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
40
+ llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 0
41
+ llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 2
42
+ llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 1
43
+ llama_model_loader: - kv 32: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type f16: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = F16
48
+ print_info: file size = 67.34 GiB (16.00 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 68955.52 MiB
105
+ load_tensors: CUDA0 model buffer size = 10300.86 MiB
106
+ load_tensors: CUDA1 model buffer size = 10300.86 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 44.408 ms
138
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 16.82 seconds per pass - ETA 4.48 minutes
140
+ [1]2.6570,[2]2.8379,[3]3.2831,[4]3.5322,[5]4.0765,[6]4.3588,[7]4.5803,[8]4.7069,[9]4.8497,[10]5.0093,[11]5.0902,[12]5.1612,[13]5.2995,[14]5.4091,[15]5.4418,[16]5.4475,
141
+ Final estimate: PPL = 5.4475 +/- 0.12099
142
+
143
+ llama_perf_context_print: load time = 7845.77 ms
144
+ llama_perf_context_print: prompt eval time = 265165.37 ms / 32768 tokens ( 8.09 ms per token, 123.58 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 266180.05 ms / 32769 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 8400 + (12208 = 10300 + 80 + 1828) + 3497 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 12273 + (10574 = 10300 + 80 + 194) + 1275 |
151
+ llama_memory_breakdown_print: | - Host | 69321 = 68955 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B MXFP4 MoE | 40.09 GiB | 36.15 B | CUDA | 35 | pp8 | 16.76 ± 1.58 |
9
+ | seed_oss 36B MXFP4 MoE | 40.09 GiB | 36.15 B | CUDA | 35 | tg128 | 2.52 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21080 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type f16: 66 tensors
46
+ llama_model_loader: - type q8_0: 384 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 40.09 GiB (9.53 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 29172.55 MiB
106
+ load_tensors: CUDA0 model buffer size = 5941.48 MiB
107
+ load_tensors: CUDA1 model buffer size = 5941.48 MiB
108
+ ...............................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 111.27 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 10.94 seconds per pass - ETA 8.75 minutes
141
+ [1]1.5141,[2]1.4450,[3]1.2782,[4]1.2249,[5]1.1818,[6]1.2694,[7]1.3750,[8]1.4334,[9]1.4167,[10]1.3942,[11]1.3724,[12]1.3782,[13]1.3787,[14]1.3647,[15]1.3463,[16]1.3631,[17]1.3642,[18]1.3459,[19]1.3433,[20]1.3592,[21]1.3493,[22]1.3391,[23]1.3495,[24]1.3438,[25]1.3479,[26]1.3436,[27]1.3614,[28]1.3667,[29]1.3673,[30]1.3681,[31]1.3655,[32]1.3760,[33]1.3763,[34]1.3687,[35]1.3648,[36]1.3601,[37]1.3677,[38]1.3767,[39]1.3682,[40]1.3901,[41]1.3989,[42]1.4018,[43]1.4103,[44]1.4115,[45]1.4052,[46]1.4083,[47]1.4121,[48]1.4132,
142
+ Final estimate: PPL = 1.4132 +/- 0.00953
143
+
144
+ llama_perf_context_print: load time = 6489.01 ms
145
+ llama_perf_context_print: prompt eval time = 513416.32 ms / 98304 tokens ( 5.22 ms per token, 191.47 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 514990.09 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13021 + ( 7849 = 5941 + 80 + 1828) + 3235 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 16633 + ( 6215 = 5941 + 80 + 194) + 1275 |
152
+ llama_memory_breakdown_print: | - Host | 29538 = 29172 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20979 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type f16: 66 tensors
46
+ llama_model_loader: - type q8_0: 384 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 40.09 GiB (9.53 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 29172.55 MiB
106
+ load_tensors: CUDA0 model buffer size = 5941.48 MiB
107
+ load_tensors: CUDA1 model buffer size = 5941.48 MiB
108
+ ...............................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 54.239 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 11.11 seconds per pass - ETA 2.77 minutes
141
+ [1]7.1571,[2]8.0959,[3]8.4409,[4]8.2030,[5]7.9871,[6]6.7194,[7]5.9263,[8]5.9868,[9]6.2585,[10]6.3190,[11]6.4574,[12]6.7899,[13]6.8056,[14]6.8799,[15]6.8893,
142
+ Final estimate: PPL = 6.8893 +/- 0.16795
143
+
144
+ llama_perf_context_print: load time = 6028.62 ms
145
+ llama_perf_context_print: prompt eval time = 161661.00 ms / 30720 tokens ( 5.26 ms per token, 190.03 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 162156.03 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13022 + ( 7849 = 5941 + 80 + 1828) + 3235 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 16633 + ( 6215 = 5941 + 80 + 194) + 1275 |
152
+ llama_memory_breakdown_print: | - Host | 29538 = 29172 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21079 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type f16: 66 tensors
46
+ llama_model_loader: - type q8_0: 384 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 40.09 GiB (9.53 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 29172.55 MiB
106
+ load_tensors: CUDA0 model buffer size = 5941.48 MiB
107
+ load_tensors: CUDA1 model buffer size = 5941.48 MiB
108
+ ...............................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 1828.00 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 46.33 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 11.19 seconds per pass - ETA 2.98 minutes
141
+ [1]2.6690,[2]2.8472,[3]3.2879,[4]3.5394,[5]4.0797,[6]4.3630,[7]4.5842,[8]4.7104,[9]4.8537,[10]5.0139,[11]5.0934,[12]5.1646,[13]5.3027,[14]5.4117,[15]5.4443,[16]5.4508,
142
+ Final estimate: PPL = 5.4508 +/- 0.12108
143
+
144
+ llama_perf_context_print: load time = 6148.63 ms
145
+ llama_perf_context_print: prompt eval time = 172451.06 ms / 32768 tokens ( 5.26 ms per token, 190.01 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 172991.07 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13019 + ( 7849 = 5941 + 80 + 1828) + 3238 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 16633 + ( 6215 = 5941 + 80 + 194) + 1275 |
152
+ llama_memory_breakdown_print: | - Host | 29538 = 29172 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B MXFP4 MoE | 34.66 GiB | 36.15 B | CUDA | 35 | pp8 | 19.67 ± 1.97 |
9
+ | seed_oss 36B MXFP4 MoE | 34.66 GiB | 36.15 B | CUDA | 35 | tg128 | 2.98 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21079 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 384 tensors
46
+ llama_model_loader: - type q6_K: 66 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 34.66 GiB (8.24 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 24790.01 MiB
106
+ load_tensors: CUDA0 model buffer size = 5351.64 MiB
107
+ load_tensors: CUDA1 model buffer size = 5351.64 MiB
108
+ ...................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 934.39 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 110.481 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 9.96 seconds per pass - ETA 7.97 minutes
141
+ [1]1.5109,[2]1.4432,[3]1.2771,[4]1.2244,[5]1.1814,[6]1.2691,[7]1.3743,[8]1.4325,[9]1.4161,[10]1.3936,[11]1.3718,[12]1.3776,[13]1.3780,[14]1.3640,[15]1.3455,[16]1.3622,[17]1.3632,[18]1.3450,[19]1.3424,[20]1.3585,[21]1.3486,[22]1.3385,[23]1.3490,[24]1.3433,[25]1.3475,[26]1.3433,[27]1.3613,[28]1.3667,[29]1.3673,[30]1.3680,[31]1.3653,[32]1.3758,[33]1.3761,[34]1.3686,[35]1.3647,[36]1.3599,[37]1.3675,[38]1.3764,[39]1.3679,[40]1.3897,[41]1.3985,[42]1.4013,[43]1.4098,[44]1.4111,[45]1.4048,[46]1.4080,[47]1.4117,[48]1.4128,
142
+ Final estimate: PPL = 1.4128 +/- 0.00951
143
+
144
+ llama_perf_context_print: load time = 4676.55 ms
145
+ llama_perf_context_print: prompt eval time = 468104.34 ms / 98304 tokens ( 4.76 ms per token, 210.00 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 469801.52 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14518 + ( 6366 = 5351 + 80 + 934) + 3222 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17235 + ( 5625 = 5351 + 80 + 194) + 1262 |
152
+ llama_memory_breakdown_print: | - Host | 25156 = 24790 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21086 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 384 tensors
46
+ llama_model_loader: - type q6_K: 66 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 34.66 GiB (8.24 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 24790.01 MiB
106
+ load_tensors: CUDA0 model buffer size = 5351.64 MiB
107
+ load_tensors: CUDA1 model buffer size = 5351.64 MiB
108
+ ...................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 934.39 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 56.144 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 10.01 seconds per pass - ETA 2.50 minutes
141
+ [1]7.1304,[2]8.0728,[3]8.4291,[4]8.2015,[5]7.9879,[6]6.7206,[7]5.9265,[8]5.9860,[9]6.2591,[10]6.3192,[11]6.4574,[12]6.7926,[13]6.8101,[14]6.8853,[15]6.8946,
142
+ Final estimate: PPL = 6.8946 +/- 0.16823
143
+
144
+ llama_perf_context_print: load time = 6455.75 ms
145
+ llama_perf_context_print: prompt eval time = 146590.98 ms / 30720 tokens ( 4.77 ms per token, 209.56 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 147068.58 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14515 + ( 6366 = 5351 + 80 + 934) + 3225 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17235 + ( 5625 = 5351 + 80 + 194) + 1262 |
152
+ llama_memory_breakdown_print: | - Host | 25156 = 24790 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21082 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 384 tensors
46
+ llama_model_loader: - type q6_K: 66 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 34.66 GiB (8.24 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 24790.01 MiB
106
+ load_tensors: CUDA0 model buffer size = 5351.64 MiB
107
+ load_tensors: CUDA1 model buffer size = 5351.64 MiB
108
+ ...................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 934.39 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 53.45 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 9.97 seconds per pass - ETA 2.65 minutes
141
+ [1]2.6690,[2]2.8422,[3]3.2904,[4]3.5426,[5]4.0832,[6]4.3638,[7]4.5860,[8]4.7144,[9]4.8567,[10]5.0171,[11]5.0967,[12]5.1690,[13]5.3080,[14]5.4159,[15]5.4468,[16]5.4539,
142
+ Final estimate: PPL = 5.4539 +/- 0.12129
143
+
144
+ llama_perf_context_print: load time = 4703.04 ms
145
+ llama_perf_context_print: prompt eval time = 156018.54 ms / 32768 tokens ( 4.76 ms per token, 210.03 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 156524.18 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14255 + ( 6366 = 5351 + 80 + 934) + 3485 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17235 + ( 5625 = 5351 + 80 + 194) + 1262 |
152
+ llama_memory_breakdown_print: | - Host | 25156 = 24790 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B MXFP4 MoE | 35.78 GiB | 36.15 B | CUDA | 35 | pp8 | 19.10 ± 1.81 |
9
+ | seed_oss 36B MXFP4 MoE | 35.78 GiB | 36.15 B | CUDA | 35 | tg128 | 2.90 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20819 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = MXFP4 MoE
48
+ print_info: file size = 35.78 GiB (8.50 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 25689.74 MiB
105
+ load_tensors: CUDA0 model buffer size = 5472.73 MiB
106
+ load_tensors: CUDA1 model buffer size = 5472.73 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1117.84 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 110.03 ms
138
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 10.13 seconds per pass - ETA 8.10 minutes
140
+ [1]1.5121,[2]1.4433,[3]1.2772,[4]1.2245,[5]1.1815,[6]1.2693,[7]1.3747,[8]1.4329,[9]1.4162,[10]1.3938,[11]1.3719,[12]1.3778,[13]1.3781,[14]1.3642,[15]1.3456,[16]1.3626,[17]1.3636,[18]1.3453,[19]1.3427,[20]1.3587,[21]1.3488,[22]1.3387,[23]1.3492,[24]1.3435,[25]1.3477,[26]1.3434,[27]1.3614,[28]1.3666,[29]1.3672,[30]1.3680,[31]1.3654,[32]1.3758,[33]1.3762,[34]1.3686,[35]1.3647,[36]1.3599,[37]1.3675,[38]1.3765,[39]1.3680,[40]1.3899,[41]1.3987,[42]1.4016,[43]1.4100,[44]1.4112,[45]1.4049,[46]1.4081,[47]1.4118,[48]1.4130,
141
+ Final estimate: PPL = 1.4130 +/- 0.00952
142
+
143
+ llama_perf_context_print: load time = 5278.59 ms
144
+ llama_perf_context_print: prompt eval time = 476416.36 ms / 98304 tokens ( 4.85 ms per token, 206.34 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 477987.89 ms / 98305 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13949 + ( 6670 = 5472 + 80 + 1117) + 3486 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17113 + ( 5746 = 5472 + 80 + 194) + 1263 |
151
+ llama_memory_breakdown_print: | - Host | 26055 = 25689 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20819 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = MXFP4 MoE
48
+ print_info: file size = 35.78 GiB (8.50 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 25689.74 MiB
105
+ load_tensors: CUDA0 model buffer size = 5472.73 MiB
106
+ load_tensors: CUDA1 model buffer size = 5472.73 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1117.84 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 47.711 ms
138
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 10.15 seconds per pass - ETA 2.53 minutes
140
+ [1]7.1724,[2]8.1046,[3]8.4375,[4]8.2030,[5]7.9884,[6]6.7193,[7]5.9253,[8]5.9835,[9]6.2553,[10]6.3142,[11]6.4535,[12]6.7854,[13]6.8012,[14]6.8765,[15]6.8866,
141
+ Final estimate: PPL = 6.8866 +/- 0.16788
142
+
143
+ llama_perf_context_print: load time = 5411.54 ms
144
+ llama_perf_context_print: prompt eval time = 148859.23 ms / 30720 tokens ( 4.85 ms per token, 206.37 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 149701.96 ms / 30721 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13951 + ( 6670 = 5472 + 80 + 1117) + 3484 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17113 + ( 5746 = 5472 + 80 + 194) + 1263 |
151
+ llama_memory_breakdown_print: | - Host | 26055 = 25689 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20817 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 450 tensors
46
+ print_info: file format = GGUF V3 (latest)
47
+ print_info: file type = MXFP4 MoE
48
+ print_info: file size = 35.78 GiB (8.50 BPW)
49
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
50
+ load: printing all EOG tokens:
51
+ load: - 2 ('<seed:eos>')
52
+ load: special tokens cache size = 128
53
+ load: token to piece cache size = 0.9296 MB
54
+ print_info: arch = seed_oss
55
+ print_info: vocab_only = 0
56
+ print_info: n_ctx_train = 524288
57
+ print_info: n_embd = 5120
58
+ print_info: n_embd_inp = 5120
59
+ print_info: n_layer = 64
60
+ print_info: n_head = 80
61
+ print_info: n_head_kv = 8
62
+ print_info: n_rot = 128
63
+ print_info: n_swa = 0
64
+ print_info: is_swa_any = 0
65
+ print_info: n_embd_head_k = 128
66
+ print_info: n_embd_head_v = 128
67
+ print_info: n_gqa = 10
68
+ print_info: n_embd_k_gqa = 1024
69
+ print_info: n_embd_v_gqa = 1024
70
+ print_info: f_norm_eps = 0.0e+00
71
+ print_info: f_norm_rms_eps = 1.0e-06
72
+ print_info: f_clamp_kqv = 0.0e+00
73
+ print_info: f_max_alibi_bias = 0.0e+00
74
+ print_info: f_logit_scale = 0.0e+00
75
+ print_info: f_attn_scale = 0.0e+00
76
+ print_info: n_ff = 27648
77
+ print_info: n_expert = 0
78
+ print_info: n_expert_used = 0
79
+ print_info: n_expert_groups = 0
80
+ print_info: n_group_used = 0
81
+ print_info: causal attn = 1
82
+ print_info: pooling type = 0
83
+ print_info: rope type = 2
84
+ print_info: rope scaling = linear
85
+ print_info: freq_base_train = 10000000.0
86
+ print_info: freq_scale_train = 1
87
+ print_info: n_ctx_orig_yarn = 524288
88
+ print_info: rope_finetuned = unknown
89
+ print_info: model type = 36B
90
+ print_info: model params = 36.15 B
91
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
92
+ print_info: vocab type = BPE
93
+ print_info: n_vocab = 155136
94
+ print_info: n_merges = 154737
95
+ print_info: BOS token = 0 '<seed:bos>'
96
+ print_info: EOS token = 2 '<seed:eos>'
97
+ print_info: PAD token = 1 '<seed:pad>'
98
+ print_info: LF token = 326 'Ċ'
99
+ print_info: EOG token = 2 '<seed:eos>'
100
+ print_info: max token length = 1024
101
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
102
+ load_tensors: offloading 20 repeating layers to GPU
103
+ load_tensors: offloaded 20/65 layers to GPU
104
+ load_tensors: CPU_Mapped model buffer size = 25689.74 MiB
105
+ load_tensors: CUDA0 model buffer size = 5472.73 MiB
106
+ load_tensors: CUDA1 model buffer size = 5472.73 MiB
107
+ ..................................................................................................
108
+ llama_context: constructing llama_context
109
+ llama_context: n_seq_max = 1
110
+ llama_context: n_ctx = 2048
111
+ llama_context: n_ctx_seq = 2048
112
+ llama_context: n_batch = 2048
113
+ llama_context: n_ubatch = 512
114
+ llama_context: causal_attn = 1
115
+ llama_context: flash_attn = auto
116
+ llama_context: kv_unified = false
117
+ llama_context: freq_base = 10000000.0
118
+ llama_context: freq_scale = 1
119
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
120
+ llama_context: CPU output buffer size = 0.59 MiB
121
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
122
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
123
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
125
+ llama_context: Flash Attention was auto, set to enabled
126
+ llama_context: CUDA0 compute buffer size = 1117.84 MiB
127
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
128
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
129
+ llama_context: graph nodes = 2183
130
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
131
+ common_init_from_params: added <seed:eos> logit bias = -inf
132
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
133
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
134
+
135
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
136
+ perplexity: tokenizing the input ..
137
+ perplexity: tokenization took 45.348 ms
138
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
139
+ perplexity: 10.14 seconds per pass - ETA 2.70 minutes
140
+ [1]2.6609,[2]2.8390,[3]3.2814,[4]3.5349,[5]4.0776,[6]4.3592,[7]4.5794,[8]4.7074,[9]4.8513,[10]5.0108,[11]5.0902,[12]5.1610,[13]5.2988,[14]5.4088,[15]5.4411,[16]5.4474,
141
+ Final estimate: PPL = 5.4474 +/- 0.12099
142
+
143
+ llama_perf_context_print: load time = 4891.84 ms
144
+ llama_perf_context_print: prompt eval time = 158836.02 ms / 32768 tokens ( 4.85 ms per token, 206.30 tokens per second)
145
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
146
+ llama_perf_context_print: total time = 159331.15 ms / 32769 tokens
147
+ llama_perf_context_print: graphs reused = 0
148
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
149
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 13949 + ( 6670 = 5472 + 80 + 1117) + 3486 |
150
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17113 + ( 5746 = 5472 + 80 + 194) + 1263 |
151
+ llama_memory_breakdown_print: | - Host | 26055 = 25689 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B MXFP4 MoE | 33.54 GiB | 36.15 B | CUDA | 35 | pp8 | 20.55 ± 0.66 |
9
+ | seed_oss 36B MXFP4 MoE | 33.54 GiB | 36.15 B | CUDA | 35 | tg128 | 3.11 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_code.txt ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20815 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 384 tensors
46
+ llama_model_loader: - type q6_K: 1 tensors
47
+ llama_model_loader: - type mxfp4: 65 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = MXFP4 MoE
50
+ print_info: file size = 33.54 GiB (7.97 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 23935.11 MiB
107
+ load_tensors: CUDA0 model buffer size = 5207.11 MiB
108
+ load_tensors: CUDA1 model buffer size = 5207.11 MiB
109
+ ....................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 110.913 ms
140
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 9.71 seconds per pass - ETA 7.77 minutes
142
+ [1]1.5039,[2]1.4366,[3]1.2732,[4]1.2227,[5]1.1808,[6]1.2697,[7]1.3771,[8]1.4370,[9]1.4212,[10]1.3989,[11]1.3764,[12]1.3833,[13]1.3834,[14]1.3692,[15]1.3523,[16]1.3691,[17]1.3709,[18]1.3525,[19]1.3494,[20]1.3651,[21]1.3555,[22]1.3455,[23]1.3559,[24]1.3503,[25]1.3543,[26]1.3507,[27]1.3683,[28]1.3735,[29]1.3738,[30]1.3740,[31]1.3711,[32]1.3820,[33]1.3821,[34]1.3743,[35]1.3703,[36]1.3658,[37]1.3738,[38]1.3828,[39]1.3743,[40]1.3961,[41]1.4052,[42]1.4080,[43]1.4167,[44]1.4181,[45]1.4116,[46]1.4146,[47]1.4186,[48]1.4199,
143
+ Final estimate: PPL = 1.4199 +/- 0.00955
144
+
145
+ llama_perf_context_print: load time = 4967.46 ms
146
+ llama_perf_context_print: prompt eval time = 456537.28 ms / 98304 tokens ( 4.64 ms per token, 215.33 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 458289.54 ms / 98305 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14616 + ( 6002 = 5207 + 80 + 715) + 3487 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
153
+ llama_memory_breakdown_print: | - Host | 24301 = 23935 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_general.txt ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20817 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 384 tensors
46
+ llama_model_loader: - type q6_K: 1 tensors
47
+ llama_model_loader: - type mxfp4: 65 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = MXFP4 MoE
50
+ print_info: file size = 33.54 GiB (7.97 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 23935.11 MiB
107
+ load_tensors: CUDA0 model buffer size = 5207.11 MiB
108
+ load_tensors: CUDA1 model buffer size = 5207.11 MiB
109
+ ....................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 48.496 ms
140
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 9.74 seconds per pass - ETA 2.43 minutes
142
+ [1]7.1500,[2]8.2084,[3]8.6095,[4]8.2993,[5]8.0678,[6]6.7890,[7]5.9811,[8]6.0378,[9]6.3209,[10]6.3818,[11]6.5155,[12]6.8651,[13]6.8791,[14]6.9610,[15]6.9649,
143
+ Final estimate: PPL = 6.9649 +/- 0.16907
144
+
145
+ llama_perf_context_print: load time = 4554.25 ms
146
+ llama_perf_context_print: prompt eval time = 142787.36 ms / 30720 tokens ( 4.65 ms per token, 215.15 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 143251.19 ms / 30721 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14615 + ( 6002 = 5207 + 80 + 715) + 3489 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
153
+ llama_memory_breakdown_print: | - Host | 24301 = 23935 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_math.txt ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20816 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 384 tensors
46
+ llama_model_loader: - type q6_K: 1 tensors
47
+ llama_model_loader: - type mxfp4: 65 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = MXFP4 MoE
50
+ print_info: file size = 33.54 GiB (7.97 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 23935.11 MiB
107
+ load_tensors: CUDA0 model buffer size = 5207.11 MiB
108
+ load_tensors: CUDA1 model buffer size = 5207.11 MiB
109
+ ....................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 44.607 ms
140
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 9.70 seconds per pass - ETA 2.58 minutes
142
+ [1]2.6818,[2]2.8306,[3]3.2874,[4]3.5465,[5]4.1012,[6]4.3894,[7]4.6191,[8]4.7375,[9]4.8855,[10]5.0570,[11]5.1436,[12]5.2350,[13]5.3773,[14]5.4922,[15]5.5241,[16]5.5327,
143
+ Final estimate: PPL = 5.5327 +/- 0.12265
144
+
145
+ llama_perf_context_print: load time = 4565.69 ms
146
+ llama_perf_context_print: prompt eval time = 151956.96 ms / 32768 tokens ( 4.64 ms per token, 215.64 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 152450.77 ms / 32769 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14620 + ( 6002 = 5207 + 80 + 715) + 3483 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
153
+ llama_memory_breakdown_print: | - Host | 24301 = 23935 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B MXFP4 MoE | 33.72 GiB | 36.15 B | CUDA | 35 | pp8 | 19.48 ± 1.28 |
9
+ | seed_oss 36B MXFP4 MoE | 33.72 GiB | 36.15 B | CUDA | 35 | tg128 | 3.09 ± 0.01 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_code.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 385 tensors
46
+ llama_model_loader: - type mxfp4: 65 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 33.72 GiB (8.01 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 24118.57 MiB
106
+ load_tensors: CUDA0 model buffer size = 5207.11 MiB
107
+ load_tensors: CUDA1 model buffer size = 5207.11 MiB
108
+ ...................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 110.758 ms
139
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 9.71 seconds per pass - ETA 7.77 minutes
141
+ [1]1.5028,[2]1.4359,[3]1.2728,[4]1.2224,[5]1.1806,[6]1.2696,[7]1.3771,[8]1.4369,[9]1.4213,[10]1.3990,[11]1.3766,[12]1.3833,[13]1.3833,[14]1.3692,[15]1.3523,[16]1.3690,[17]1.3708,[18]1.3524,[19]1.3493,[20]1.3652,[21]1.3556,[22]1.3455,[23]1.3559,[24]1.3502,[25]1.3543,[26]1.3506,[27]1.3683,[28]1.3733,[29]1.3736,[30]1.3738,[31]1.3709,[32]1.3817,[33]1.3819,[34]1.3740,[35]1.3700,[36]1.3655,[37]1.3736,[38]1.3826,[39]1.3741,[40]1.3959,[41]1.4050,[42]1.4078,[43]1.4165,[44]1.4179,[45]1.4114,[46]1.4144,[47]1.4184,[48]1.4198,
142
+ Final estimate: PPL = 1.4198 +/- 0.00955
143
+
144
+ llama_perf_context_print: load time = 4600.76 ms
145
+ llama_perf_context_print: prompt eval time = 454918.27 ms / 98304 tokens ( 4.63 ms per token, 216.09 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 456481.77 ms / 98305 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14620 + ( 6002 = 5207 + 80 + 715) + 3483 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
152
+ llama_memory_breakdown_print: | - Host | 24484 = 24118 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_general.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 385 tensors
46
+ llama_model_loader: - type mxfp4: 65 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 33.72 GiB (8.01 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 24118.57 MiB
106
+ load_tensors: CUDA0 model buffer size = 5207.11 MiB
107
+ load_tensors: CUDA1 model buffer size = 5207.11 MiB
108
+ ...................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 68.738 ms
139
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 17.37 seconds per pass - ETA 4.33 minutes
141
+ [1]7.1376,[2]8.2054,[3]8.6103,[4]8.2935,[5]8.0625,[6]6.7858,[7]5.9786,[8]6.0365,[9]6.3174,[10]6.3803,[11]6.5135,[12]6.8626,[13]6.8769,[14]6.9592,[15]6.9638,
142
+ Final estimate: PPL = 6.9638 +/- 0.16907
143
+
144
+ llama_perf_context_print: load time = 8554.11 ms
145
+ llama_perf_context_print: prompt eval time = 150106.42 ms / 30720 tokens ( 4.89 ms per token, 204.65 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 150654.44 ms / 30721 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14620 + ( 6002 = 5207 + 80 + 715) + 3483 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
152
+ llama_memory_breakdown_print: | - Host | 24484 = 24118 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_math.txt ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 385 tensors
46
+ llama_model_loader: - type mxfp4: 65 tensors
47
+ print_info: file format = GGUF V3 (latest)
48
+ print_info: file type = MXFP4 MoE
49
+ print_info: file size = 33.72 GiB (8.01 BPW)
50
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
51
+ load: printing all EOG tokens:
52
+ load: - 2 ('<seed:eos>')
53
+ load: special tokens cache size = 128
54
+ load: token to piece cache size = 0.9296 MB
55
+ print_info: arch = seed_oss
56
+ print_info: vocab_only = 0
57
+ print_info: n_ctx_train = 524288
58
+ print_info: n_embd = 5120
59
+ print_info: n_embd_inp = 5120
60
+ print_info: n_layer = 64
61
+ print_info: n_head = 80
62
+ print_info: n_head_kv = 8
63
+ print_info: n_rot = 128
64
+ print_info: n_swa = 0
65
+ print_info: is_swa_any = 0
66
+ print_info: n_embd_head_k = 128
67
+ print_info: n_embd_head_v = 128
68
+ print_info: n_gqa = 10
69
+ print_info: n_embd_k_gqa = 1024
70
+ print_info: n_embd_v_gqa = 1024
71
+ print_info: f_norm_eps = 0.0e+00
72
+ print_info: f_norm_rms_eps = 1.0e-06
73
+ print_info: f_clamp_kqv = 0.0e+00
74
+ print_info: f_max_alibi_bias = 0.0e+00
75
+ print_info: f_logit_scale = 0.0e+00
76
+ print_info: f_attn_scale = 0.0e+00
77
+ print_info: n_ff = 27648
78
+ print_info: n_expert = 0
79
+ print_info: n_expert_used = 0
80
+ print_info: n_expert_groups = 0
81
+ print_info: n_group_used = 0
82
+ print_info: causal attn = 1
83
+ print_info: pooling type = 0
84
+ print_info: rope type = 2
85
+ print_info: rope scaling = linear
86
+ print_info: freq_base_train = 10000000.0
87
+ print_info: freq_scale_train = 1
88
+ print_info: n_ctx_orig_yarn = 524288
89
+ print_info: rope_finetuned = unknown
90
+ print_info: model type = 36B
91
+ print_info: model params = 36.15 B
92
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
93
+ print_info: vocab type = BPE
94
+ print_info: n_vocab = 155136
95
+ print_info: n_merges = 154737
96
+ print_info: BOS token = 0 '<seed:bos>'
97
+ print_info: EOS token = 2 '<seed:eos>'
98
+ print_info: PAD token = 1 '<seed:pad>'
99
+ print_info: LF token = 326 'Ċ'
100
+ print_info: EOG token = 2 '<seed:eos>'
101
+ print_info: max token length = 1024
102
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
103
+ load_tensors: offloading 20 repeating layers to GPU
104
+ load_tensors: offloaded 20/65 layers to GPU
105
+ load_tensors: CPU_Mapped model buffer size = 24118.57 MiB
106
+ load_tensors: CUDA0 model buffer size = 5207.11 MiB
107
+ load_tensors: CUDA1 model buffer size = 5207.11 MiB
108
+ ...................................................................................................
109
+ llama_context: constructing llama_context
110
+ llama_context: n_seq_max = 1
111
+ llama_context: n_ctx = 2048
112
+ llama_context: n_ctx_seq = 2048
113
+ llama_context: n_batch = 2048
114
+ llama_context: n_ubatch = 512
115
+ llama_context: causal_attn = 1
116
+ llama_context: flash_attn = auto
117
+ llama_context: kv_unified = false
118
+ llama_context: freq_base = 10000000.0
119
+ llama_context: freq_scale = 1
120
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
121
+ llama_context: CPU output buffer size = 0.59 MiB
122
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
123
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
124
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
126
+ llama_context: Flash Attention was auto, set to enabled
127
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
128
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
129
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
130
+ llama_context: graph nodes = 2183
131
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
132
+ common_init_from_params: added <seed:eos> logit bias = -inf
133
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
134
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
135
+
136
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
137
+ perplexity: tokenizing the input ..
138
+ perplexity: tokenization took 43.7 ms
139
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
140
+ perplexity: 9.69 seconds per pass - ETA 2.58 minutes
141
+ [1]2.6870,[2]2.8316,[3]3.2916,[4]3.5489,[5]4.1042,[6]4.3929,[7]4.6231,[8]4.7409,[9]4.8891,[10]5.0611,[11]5.1465,[12]5.2375,[13]5.3797,[14]5.4944,[15]5.5263,[16]5.5341,
142
+ Final estimate: PPL = 5.5341 +/- 0.12273
143
+
144
+ llama_perf_context_print: load time = 4608.15 ms
145
+ llama_perf_context_print: prompt eval time = 151699.18 ms / 32768 tokens ( 4.63 ms per token, 216.01 tokens per second)
146
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
147
+ llama_perf_context_print: total time = 152194.46 ms / 32769 tokens
148
+ llama_perf_context_print: graphs reused = 0
149
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
150
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14627 + ( 6002 = 5207 + 80 + 715) + 3476 |
151
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17379 + ( 5481 = 5207 + 80 + 194) + 1263 |
152
+ llama_memory_breakdown_print: | - Host | 24484 = 24118 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/llamabench.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ | model | size | params | backend | ngl | test | t/s |
7
+ | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
8
+ | seed_oss 36B MXFP4 MoE | 31.50 GiB | 36.15 B | CUDA | 35 | pp8 | 21.61 ± 0.31 |
9
+ | seed_oss 36B MXFP4 MoE | 31.50 GiB | 36.15 B | CUDA | 35 | tg128 | 3.30 ± 0.00 |
10
+
11
+ build: 92bb442ad (7040)
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_code.txt ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20825 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 320 tensors
46
+ llama_model_loader: - type q6_K: 65 tensors
47
+ llama_model_loader: - type mxfp4: 65 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = MXFP4 MoE
50
+ print_info: file size = 31.50 GiB (7.48 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 22496.52 MiB
107
+ load_tensors: CUDA0 model buffer size = 4880.16 MiB
108
+ load_tensors: CUDA1 model buffer size = 4880.16 MiB
109
+ ...................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 110.832 ms
140
+ perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 9.34 seconds per pass - ETA 7.47 minutes
142
+ [1]1.5043,[2]1.4364,[3]1.2731,[4]1.2222,[5]1.1804,[6]1.2691,[7]1.3768,[8]1.4367,[9]1.4211,[10]1.3988,[11]1.3764,[12]1.3832,[13]1.3832,[14]1.3690,[15]1.3520,[16]1.3687,[17]1.3707,[18]1.3523,[19]1.3493,[20]1.3651,[21]1.3555,[22]1.3454,[23]1.3557,[24]1.3501,[25]1.3542,[26]1.3505,[27]1.3680,[28]1.3731,[29]1.3734,[30]1.3736,[31]1.3708,[32]1.3817,[33]1.3819,[34]1.3740,[35]1.3700,[36]1.3654,[37]1.3735,[38]1.3825,[39]1.3740,[40]1.3959,[41]1.4049,[42]1.4077,[43]1.4163,[44]1.4177,[45]1.4112,[46]1.4143,[47]1.4182,[48]1.4196,
143
+ Final estimate: PPL = 1.4196 +/- 0.00955
144
+
145
+ llama_perf_context_print: load time = 4289.82 ms
146
+ llama_perf_context_print: prompt eval time = 437290.20 ms / 98304 tokens ( 4.45 ms per token, 224.80 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 438753.19 ms / 98305 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14946 + ( 5675 = 4880 + 80 + 715) + 3484 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17705 + ( 5154 = 4880 + 80 + 194) + 1264 |
153
+ llama_memory_breakdown_print: | - Host | 22862 = 22496 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_general.txt ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20825 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 320 tensors
46
+ llama_model_loader: - type q6_K: 65 tensors
47
+ llama_model_loader: - type mxfp4: 65 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = MXFP4 MoE
50
+ print_info: file size = 31.50 GiB (7.48 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 22496.52 MiB
107
+ load_tensors: CUDA0 model buffer size = 4880.16 MiB
108
+ load_tensors: CUDA1 model buffer size = 4880.16 MiB
109
+ ...................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 48.169 ms
140
+ perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 9.35 seconds per pass - ETA 2.33 minutes
142
+ [1]7.1294,[2]8.1907,[3]8.6017,[4]8.2992,[5]8.0687,[6]6.7895,[7]5.9829,[8]6.0366,[9]6.3191,[10]6.3793,[11]6.5139,[12]6.8639,[13]6.8784,[14]6.9598,[15]6.9647,
143
+ Final estimate: PPL = 6.9647 +/- 0.16904
144
+
145
+ llama_perf_context_print: load time = 4633.11 ms
146
+ llama_perf_context_print: prompt eval time = 136636.03 ms / 30720 tokens ( 4.45 ms per token, 224.83 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 137103.47 ms / 30721 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14951 + ( 5675 = 4880 + 80 + 715) + 3479 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17705 + ( 5154 = 4880 + 80 + 194) + 1264 |
153
+ llama_memory_breakdown_print: | - Host | 22862 = 22496 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_math.txt ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2
+ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
3
+ ggml_cuda_init: found 2 CUDA devices:
4
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
5
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
6
+ build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
7
+ llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
8
+ llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
9
+ llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
10
+ llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
11
+ llama_model_loader: - kv 0: general.architecture str = seed_oss
12
+ llama_model_loader: - kv 1: general.type str = model
13
+ llama_model_loader: - kv 2: general.name str = Seed OSS 36B Instruct Unsloth
14
+ llama_model_loader: - kv 3: general.finetune str = Instruct-unsloth
15
+ llama_model_loader: - kv 4: general.basename str = Seed-OSS
16
+ llama_model_loader: - kv 5: general.size_label str = 36B
17
+ llama_model_loader: - kv 6: general.license str = apache-2.0
18
+ llama_model_loader: - kv 7: general.base_model.count u32 = 1
19
+ llama_model_loader: - kv 8: general.base_model.0.name str = Seed OSS 36B Instruct
20
+ llama_model_loader: - kv 9: general.base_model.0.organization str = ByteDance Seed
21
+ llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ByteDance-Seed...
22
+ llama_model_loader: - kv 11: general.tags arr[str,3] = ["vllm", "unsloth", "text-generation"]
23
+ llama_model_loader: - kv 12: seed_oss.block_count u32 = 64
24
+ llama_model_loader: - kv 13: seed_oss.context_length u32 = 524288
25
+ llama_model_loader: - kv 14: seed_oss.embedding_length u32 = 5120
26
+ llama_model_loader: - kv 15: seed_oss.feed_forward_length u32 = 27648
27
+ llama_model_loader: - kv 16: seed_oss.attention.head_count u32 = 80
28
+ llama_model_loader: - kv 17: seed_oss.attention.head_count_kv u32 = 8
29
+ llama_model_loader: - kv 18: seed_oss.rope.freq_base f32 = 10000000.000000
30
+ llama_model_loader: - kv 19: seed_oss.attention.layer_norm_rms_epsilon f32 = 0.000001
31
+ llama_model_loader: - kv 20: seed_oss.attention.key_length u32 = 128
32
+ llama_model_loader: - kv 21: seed_oss.attention.value_length u32 = 128
33
+ llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
34
+ llama_model_loader: - kv 23: tokenizer.ggml.pre str = seed-coder
35
+ llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,155136] = ["<seed:bos>", "<seed:pad>", "<seed:e...
36
+ llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,155136] = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
37
+ llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,154737] = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
38
+ llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 0
39
+ llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 2
40
+ llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 1
41
+ llama_model_loader: - kv 30: tokenizer.chat_template str = {# Unsloth Chat template fixes #}\n{# ...
42
+ llama_model_loader: - kv 31: general.quantization_version u32 = 2
43
+ llama_model_loader: - kv 32: general.file_type u32 = 38
44
+ llama_model_loader: - type f32: 321 tensors
45
+ llama_model_loader: - type q8_0: 320 tensors
46
+ llama_model_loader: - type q6_K: 65 tensors
47
+ llama_model_loader: - type mxfp4: 65 tensors
48
+ print_info: file format = GGUF V3 (latest)
49
+ print_info: file type = MXFP4 MoE
50
+ print_info: file size = 31.50 GiB (7.48 BPW)
51
+ load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
52
+ load: printing all EOG tokens:
53
+ load: - 2 ('<seed:eos>')
54
+ load: special tokens cache size = 128
55
+ load: token to piece cache size = 0.9296 MB
56
+ print_info: arch = seed_oss
57
+ print_info: vocab_only = 0
58
+ print_info: n_ctx_train = 524288
59
+ print_info: n_embd = 5120
60
+ print_info: n_embd_inp = 5120
61
+ print_info: n_layer = 64
62
+ print_info: n_head = 80
63
+ print_info: n_head_kv = 8
64
+ print_info: n_rot = 128
65
+ print_info: n_swa = 0
66
+ print_info: is_swa_any = 0
67
+ print_info: n_embd_head_k = 128
68
+ print_info: n_embd_head_v = 128
69
+ print_info: n_gqa = 10
70
+ print_info: n_embd_k_gqa = 1024
71
+ print_info: n_embd_v_gqa = 1024
72
+ print_info: f_norm_eps = 0.0e+00
73
+ print_info: f_norm_rms_eps = 1.0e-06
74
+ print_info: f_clamp_kqv = 0.0e+00
75
+ print_info: f_max_alibi_bias = 0.0e+00
76
+ print_info: f_logit_scale = 0.0e+00
77
+ print_info: f_attn_scale = 0.0e+00
78
+ print_info: n_ff = 27648
79
+ print_info: n_expert = 0
80
+ print_info: n_expert_used = 0
81
+ print_info: n_expert_groups = 0
82
+ print_info: n_group_used = 0
83
+ print_info: causal attn = 1
84
+ print_info: pooling type = 0
85
+ print_info: rope type = 2
86
+ print_info: rope scaling = linear
87
+ print_info: freq_base_train = 10000000.0
88
+ print_info: freq_scale_train = 1
89
+ print_info: n_ctx_orig_yarn = 524288
90
+ print_info: rope_finetuned = unknown
91
+ print_info: model type = 36B
92
+ print_info: model params = 36.15 B
93
+ print_info: general.name = Seed OSS 36B Instruct Unsloth
94
+ print_info: vocab type = BPE
95
+ print_info: n_vocab = 155136
96
+ print_info: n_merges = 154737
97
+ print_info: BOS token = 0 '<seed:bos>'
98
+ print_info: EOS token = 2 '<seed:eos>'
99
+ print_info: PAD token = 1 '<seed:pad>'
100
+ print_info: LF token = 326 'Ċ'
101
+ print_info: EOG token = 2 '<seed:eos>'
102
+ print_info: max token length = 1024
103
+ load_tensors: loading model tensors, this can take a while... (mmap = true)
104
+ load_tensors: offloading 20 repeating layers to GPU
105
+ load_tensors: offloaded 20/65 layers to GPU
106
+ load_tensors: CPU_Mapped model buffer size = 22496.52 MiB
107
+ load_tensors: CUDA0 model buffer size = 4880.16 MiB
108
+ load_tensors: CUDA1 model buffer size = 4880.16 MiB
109
+ ...................................................................................................
110
+ llama_context: constructing llama_context
111
+ llama_context: n_seq_max = 1
112
+ llama_context: n_ctx = 2048
113
+ llama_context: n_ctx_seq = 2048
114
+ llama_context: n_batch = 2048
115
+ llama_context: n_ubatch = 512
116
+ llama_context: causal_attn = 1
117
+ llama_context: flash_attn = auto
118
+ llama_context: kv_unified = false
119
+ llama_context: freq_base = 10000000.0
120
+ llama_context: freq_scale = 1
121
+ llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
122
+ llama_context: CPU output buffer size = 0.59 MiB
123
+ llama_kv_cache: CPU KV buffer size = 352.00 MiB
124
+ llama_kv_cache: CUDA0 KV buffer size = 80.00 MiB
125
+ llama_kv_cache: CUDA1 KV buffer size = 80.00 MiB
126
+ llama_kv_cache: size = 512.00 MiB ( 2048 cells, 64 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
127
+ llama_context: Flash Attention was auto, set to enabled
128
+ llama_context: CUDA0 compute buffer size = 715.42 MiB
129
+ llama_context: CUDA1 compute buffer size = 194.01 MiB
130
+ llama_context: CUDA_Host compute buffer size = 14.01 MiB
131
+ llama_context: graph nodes = 2183
132
+ llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
133
+ common_init_from_params: added <seed:eos> logit bias = -inf
134
+ common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
135
+ common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
136
+
137
+ system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
138
+ perplexity: tokenizing the input ..
139
+ perplexity: tokenization took 45.232 ms
140
+ perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
141
+ perplexity: 9.33 seconds per pass - ETA 2.48 minutes
142
+ [1]2.6803,[2]2.8322,[3]3.2930,[4]3.5483,[5]4.1021,[6]4.3899,[7]4.6199,[8]4.7387,[9]4.8877,[10]5.0587,[11]5.1451,[12]5.2368,[13]5.3790,[14]5.4929,[15]5.5253,[16]5.5326,
143
+ Final estimate: PPL = 5.5326 +/- 0.12270
144
+
145
+ llama_perf_context_print: load time = 4294.54 ms
146
+ llama_perf_context_print: prompt eval time = 145679.92 ms / 32768 tokens ( 4.45 ms per token, 224.93 tokens per second)
147
+ llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
148
+ llama_perf_context_print: total time = 146170.11 ms / 32769 tokens
149
+ llama_perf_context_print: graphs reused = 0
150
+ llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
151
+ llama_memory_breakdown_print: | - CUDA0 (RTX 3090) | 24107 = 14946 + ( 5675 = 4880 + 80 + 715) + 3484 |
152
+ llama_memory_breakdown_print: | - CUDA1 (RTX 3090) | 24124 = 17705 + ( 5154 = 4880 + 80 + 194) + 1264 |
153
+ llama_memory_breakdown_print: | - Host | 22862 = 22496 + 352 + 14 |
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_code.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_general.txt ADDED
The diff for this file is too large to render. See raw diff
 
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_math.txt ADDED
The diff for this file is too large to render. See raw diff