magiccodingman commited on 14 days ago

Commit

fc8bd2a

1 Parent(s): 73dfc17

Add GGUF models + tokenizer with LFS

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_code.txt +151 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_general.txt +151 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_math.txt +151 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_math.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt +151 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt +151 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt +151 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_code.txt +153 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_general.txt +153 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_math.txt +153 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_math.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_code.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_general.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_math.txt +152 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_math.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/llamabench.txt +11 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_code.txt +153 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_general.txt +153 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_math.txt +153 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_code.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_general.txt +0 -0
Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_math.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.gguf filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B F16               |  67.34 GiB |    36.15 B | CUDA       |  35 |             pp8 |         11.81 ± 0.26 |
+| seed_oss 36B F16               |  67.34 GiB |    36.15 B | CUDA       |  35 |           tg128 |          1.55 ± 0.00 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:               general.quantization_version u32              = 2
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type  f16:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 67.34 GiB (16.00 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 68955.52 MiB
+load_tensors:        CUDA0 model buffer size = 10300.86 MiB
+load_tensors:        CUDA1 model buffer size = 10300.86 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 112.946 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 16.83 seconds per pass - ETA 13.47 minutes
+[1]1.5112,[2]1.4432,[3]1.2771,[4]1.2243,[5]1.1813,[6]1.2686,[7]1.3739,[8]1.4324,[9]1.4159,[10]1.3936,[11]1.3718,[12]1.3776,[13]1.3780,[14]1.3640,[15]1.3455,[16]1.3624,[17]1.3635,[18]1.3452,[19]1.3426,[20]1.3586,[21]1.3487,[22]1.3385,[23]1.3489,[24]1.3433,[25]1.3474,[26]1.3431,[27]1.3610,[28]1.3663,[29]1.3669,[30]1.3676,[31]1.3650,[32]1.3755,[33]1.3758,[34]1.3682,[35]1.3644,[36]1.3596,[37]1.3673,[38]1.3762,[39]1.3677,[40]1.3896,[41]1.3985,[42]1.4014,[43]1.4098,[44]1.4110,[45]1.4047,[46]1.4079,[47]1.4117,[48]1.4129,
+Final estimate: PPL = 1.4129 +/- 0.00952
+llama_perf_context_print:        load time =    8471.57 ms
+llama_perf_context_print: prompt eval time =  795437.69 ms / 98304 tokens (    8.09 ms per token,   123.58 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  801965.37 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 =  8400 + (12208 = 10300 +      80 +    1828) +        3497 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 12273 + (10574 = 10300 +      80 +     194) +        1275 |
+llama_memory_breakdown_print: |   - Host               |                  69321 = 68955 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:               general.quantization_version u32              = 2
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type  f16:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 67.34 GiB (16.00 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 68955.52 MiB
+load_tensors:        CUDA0 model buffer size = 10300.86 MiB
+load_tensors:        CUDA1 model buffer size = 10300.86 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.978 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 16.85 seconds per pass - ETA 4.20 minutes
+[1]7.2108,[2]8.1347,[3]8.4667,[4]8.2219,[5]8.0076,[6]6.7314,[7]5.9343,[8]5.9926,[9]6.2640,[10]6.3232,[11]6.4603,[12]6.7925,[13]6.8088,[14]6.8826,[15]6.8905,
+Final estimate: PPL = 6.8905 +/- 0.16805
+llama_perf_context_print:        load time =    7899.11 ms
+llama_perf_context_print: prompt eval time =  248682.86 ms / 30720 tokens (    8.10 ms per token,   123.53 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  249916.95 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 =  8400 + (12208 = 10300 +      80 +    1828) +        3497 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 12273 + (10574 = 10300 +      80 +     194) +        1275 |
+llama_memory_breakdown_print: |   - Host               |                  69321 = 68955 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/Seed-OSS-36B-Instruct-unsloth-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                          general.file_type u32              = 1
+llama_model_loader: - kv  23:               general.quantization_version u32              = 2
+llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type  f16:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = F16
+print_info: file size   = 67.34 GiB (16.00 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 68955.52 MiB
+load_tensors:        CUDA0 model buffer size = 10300.86 MiB
+load_tensors:        CUDA1 model buffer size = 10300.86 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.408 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 16.82 seconds per pass - ETA 4.48 minutes
+[1]2.6570,[2]2.8379,[3]3.2831,[4]3.5322,[5]4.0765,[6]4.3588,[7]4.5803,[8]4.7069,[9]4.8497,[10]5.0093,[11]5.0902,[12]5.1612,[13]5.2995,[14]5.4091,[15]5.4418,[16]5.4475,
+Final estimate: PPL = 5.4475 +/- 0.12099
+llama_perf_context_print:        load time =    7845.77 ms
+llama_perf_context_print: prompt eval time =  265165.37 ms / 32768 tokens (    8.09 ms per token,   123.58 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  266180.05 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 =  8400 + (12208 = 10300 +      80 +    1828) +        3497 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 12273 + (10574 = 10300 +      80 +     194) +        1275 |
+llama_memory_breakdown_print: |   - Host               |                  69321 = 68955 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-F16/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B MXFP4 MoE         |  40.09 GiB |    36.15 B | CUDA       |  35 |             pp8 |         16.76 ± 1.58 |
+| seed_oss 36B MXFP4 MoE         |  40.09 GiB |    36.15 B | CUDA       |  35 |           tg128 |          2.52 ± 0.01 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21080 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type  f16:   66 tensors
+llama_model_loader: - type q8_0:  384 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 40.09 GiB (9.53 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 29172.55 MiB
+load_tensors:        CUDA0 model buffer size =  5941.48 MiB
+load_tensors:        CUDA1 model buffer size =  5941.48 MiB
+...............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 111.27 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 10.94 seconds per pass - ETA 8.75 minutes
+[1]1.5141,[2]1.4450,[3]1.2782,[4]1.2249,[5]1.1818,[6]1.2694,[7]1.3750,[8]1.4334,[9]1.4167,[10]1.3942,[11]1.3724,[12]1.3782,[13]1.3787,[14]1.3647,[15]1.3463,[16]1.3631,[17]1.3642,[18]1.3459,[19]1.3433,[20]1.3592,[21]1.3493,[22]1.3391,[23]1.3495,[24]1.3438,[25]1.3479,[26]1.3436,[27]1.3614,[28]1.3667,[29]1.3673,[30]1.3681,[31]1.3655,[32]1.3760,[33]1.3763,[34]1.3687,[35]1.3648,[36]1.3601,[37]1.3677,[38]1.3767,[39]1.3682,[40]1.3901,[41]1.3989,[42]1.4018,[43]1.4103,[44]1.4115,[45]1.4052,[46]1.4083,[47]1.4121,[48]1.4132,
+Final estimate: PPL = 1.4132 +/- 0.00953
+llama_perf_context_print:        load time =    6489.01 ms
+llama_perf_context_print: prompt eval time =  513416.32 ms / 98304 tokens (    5.22 ms per token,   191.47 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  514990.09 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 13021 + ( 7849 =  5941 +      80 +    1828) +        3235 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 16633 + ( 6215 =  5941 +      80 +     194) +        1275 |
+llama_memory_breakdown_print: |   - Host               |                  29538 = 29172 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20979 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type  f16:   66 tensors
+llama_model_loader: - type q8_0:  384 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 40.09 GiB (9.53 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 29172.55 MiB
+load_tensors:        CUDA0 model buffer size =  5941.48 MiB
+load_tensors:        CUDA1 model buffer size =  5941.48 MiB
+...............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 54.239 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 11.11 seconds per pass - ETA 2.77 minutes
+[1]7.1571,[2]8.0959,[3]8.4409,[4]8.2030,[5]7.9871,[6]6.7194,[7]5.9263,[8]5.9868,[9]6.2585,[10]6.3190,[11]6.4574,[12]6.7899,[13]6.8056,[14]6.8799,[15]6.8893,
+Final estimate: PPL = 6.8893 +/- 0.16795
+llama_perf_context_print:        load time =    6028.62 ms
+llama_perf_context_print: prompt eval time =  161661.00 ms / 30720 tokens (    5.26 ms per token,   190.03 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  162156.03 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 13022 + ( 7849 =  5941 +      80 +    1828) +        3235 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 16633 + ( 6215 =  5941 +      80 +     194) +        1275 |
+llama_memory_breakdown_print: |   - Host               |                  29538 = 29172 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21079 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type  f16:   66 tensors
+llama_model_loader: - type q8_0:  384 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 40.09 GiB (9.53 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 29172.55 MiB
+load_tensors:        CUDA0 model buffer size =  5941.48 MiB
+load_tensors:        CUDA1 model buffer size =  5941.48 MiB
+...............................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1828.00 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.33 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 11.19 seconds per pass - ETA 2.98 minutes
+[1]2.6690,[2]2.8472,[3]3.2879,[4]3.5394,[5]4.0797,[6]4.3630,[7]4.5842,[8]4.7104,[9]4.8537,[10]5.0139,[11]5.0934,[12]5.1646,[13]5.3027,[14]5.4117,[15]5.4443,[16]5.4508,
+Final estimate: PPL = 5.4508 +/- 0.12108
+llama_perf_context_print:        load time =    6148.63 ms
+llama_perf_context_print: prompt eval time =  172451.06 ms / 32768 tokens (    5.26 ms per token,   190.01 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  172991.07 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 13019 + ( 7849 =  5941 +      80 +    1828) +        3238 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 16633 + ( 6215 =  5941 +      80 +     194) +        1275 |
+llama_memory_breakdown_print: |   - Host               |                  29538 = 29172 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-F16/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B MXFP4 MoE         |  34.66 GiB |    36.15 B | CUDA       |  35 |             pp8 |         19.67 ± 1.97 |
+| seed_oss 36B MXFP4 MoE         |  34.66 GiB |    36.15 B | CUDA       |  35 |           tg128 |          2.98 ± 0.00 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21079 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  384 tensors
+llama_model_loader: - type q6_K:   66 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 34.66 GiB (8.24 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 24790.01 MiB
+load_tensors:        CUDA0 model buffer size =  5351.64 MiB
+load_tensors:        CUDA1 model buffer size =  5351.64 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   934.39 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.481 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.96 seconds per pass - ETA 7.97 minutes
+[1]1.5109,[2]1.4432,[3]1.2771,[4]1.2244,[5]1.1814,[6]1.2691,[7]1.3743,[8]1.4325,[9]1.4161,[10]1.3936,[11]1.3718,[12]1.3776,[13]1.3780,[14]1.3640,[15]1.3455,[16]1.3622,[17]1.3632,[18]1.3450,[19]1.3424,[20]1.3585,[21]1.3486,[22]1.3385,[23]1.3490,[24]1.3433,[25]1.3475,[26]1.3433,[27]1.3613,[28]1.3667,[29]1.3673,[30]1.3680,[31]1.3653,[32]1.3758,[33]1.3761,[34]1.3686,[35]1.3647,[36]1.3599,[37]1.3675,[38]1.3764,[39]1.3679,[40]1.3897,[41]1.3985,[42]1.4013,[43]1.4098,[44]1.4111,[45]1.4048,[46]1.4080,[47]1.4117,[48]1.4128,
+Final estimate: PPL = 1.4128 +/- 0.00951
+llama_perf_context_print:        load time =    4676.55 ms
+llama_perf_context_print: prompt eval time =  468104.34 ms / 98304 tokens (    4.76 ms per token,   210.00 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  469801.52 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14518 + ( 6366 =  5351 +      80 +     934) +        3222 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17235 + ( 5625 =  5351 +      80 +     194) +        1262 |
+llama_memory_breakdown_print: |   - Host               |                  25156 = 24790 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21086 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  384 tensors
+llama_model_loader: - type q6_K:   66 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 34.66 GiB (8.24 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 24790.01 MiB
+load_tensors:        CUDA0 model buffer size =  5351.64 MiB
+load_tensors:        CUDA1 model buffer size =  5351.64 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   934.39 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 56.144 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 10.01 seconds per pass - ETA 2.50 minutes
+[1]7.1304,[2]8.0728,[3]8.4291,[4]8.2015,[5]7.9879,[6]6.7206,[7]5.9265,[8]5.9860,[9]6.2591,[10]6.3192,[11]6.4574,[12]6.7926,[13]6.8101,[14]6.8853,[15]6.8946,
+Final estimate: PPL = 6.8946 +/- 0.16823
+llama_perf_context_print:        load time =    6455.75 ms
+llama_perf_context_print: prompt eval time =  146590.98 ms / 30720 tokens (    4.77 ms per token,   209.56 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  147068.58 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14515 + ( 6366 =  5351 +      80 +     934) +        3225 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17235 + ( 5625 =  5351 +      80 +     194) +        1262 |
+llama_memory_breakdown_print: |   - Host               |                  25156 = 24790 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 21082 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  384 tensors
+llama_model_loader: - type q6_K:   66 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 34.66 GiB (8.24 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 24790.01 MiB
+load_tensors:        CUDA0 model buffer size =  5351.64 MiB
+load_tensors:        CUDA1 model buffer size =  5351.64 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   934.39 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 53.45 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.97 seconds per pass - ETA 2.65 minutes
+[1]2.6690,[2]2.8422,[3]3.2904,[4]3.5426,[5]4.0832,[6]4.3638,[7]4.5860,[8]4.7144,[9]4.8567,[10]5.0171,[11]5.0967,[12]5.1690,[13]5.3080,[14]5.4159,[15]5.4468,[16]5.4539,
+Final estimate: PPL = 5.4539 +/- 0.12129
+llama_perf_context_print:        load time =    4703.04 ms
+llama_perf_context_print: prompt eval time =  156018.54 ms / 32768 tokens (    4.76 ms per token,   210.03 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  156524.18 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14255 + ( 6366 =  5351 +      80 +     934) +        3485 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17235 + ( 5625 =  5351 +      80 +     194) +        1262 |
+llama_memory_breakdown_print: |   - Host               |                  25156 = 24790 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q6_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B MXFP4 MoE         |  35.78 GiB |    36.15 B | CUDA       |  35 |             pp8 |         19.10 ± 1.81 |
+| seed_oss 36B MXFP4 MoE         |  35.78 GiB |    36.15 B | CUDA       |  35 |           tg128 |          2.90 ± 0.00 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20819 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 35.78 GiB (8.50 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 25689.74 MiB
+load_tensors:        CUDA0 model buffer size =  5472.73 MiB
+load_tensors:        CUDA1 model buffer size =  5472.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1117.84 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.03 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 10.13 seconds per pass - ETA 8.10 minutes
+[1]1.5121,[2]1.4433,[3]1.2772,[4]1.2245,[5]1.1815,[6]1.2693,[7]1.3747,[8]1.4329,[9]1.4162,[10]1.3938,[11]1.3719,[12]1.3778,[13]1.3781,[14]1.3642,[15]1.3456,[16]1.3626,[17]1.3636,[18]1.3453,[19]1.3427,[20]1.3587,[21]1.3488,[22]1.3387,[23]1.3492,[24]1.3435,[25]1.3477,[26]1.3434,[27]1.3614,[28]1.3666,[29]1.3672,[30]1.3680,[31]1.3654,[32]1.3758,[33]1.3762,[34]1.3686,[35]1.3647,[36]1.3599,[37]1.3675,[38]1.3765,[39]1.3680,[40]1.3899,[41]1.3987,[42]1.4016,[43]1.4100,[44]1.4112,[45]1.4049,[46]1.4081,[47]1.4118,[48]1.4130,
+Final estimate: PPL = 1.4130 +/- 0.00952
+llama_perf_context_print:        load time =    5278.59 ms
+llama_perf_context_print: prompt eval time =  476416.36 ms / 98304 tokens (    4.85 ms per token,   206.34 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  477987.89 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 13949 + ( 6670 =  5472 +      80 +    1117) +        3486 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17113 + ( 5746 =  5472 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  26055 = 25689 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20819 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 35.78 GiB (8.50 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 25689.74 MiB
+load_tensors:        CUDA0 model buffer size =  5472.73 MiB
+load_tensors:        CUDA1 model buffer size =  5472.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1117.84 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.711 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 10.15 seconds per pass - ETA 2.53 minutes
+[1]7.1724,[2]8.1046,[3]8.4375,[4]8.2030,[5]7.9884,[6]6.7193,[7]5.9253,[8]5.9835,[9]6.2553,[10]6.3142,[11]6.4535,[12]6.7854,[13]6.8012,[14]6.8765,[15]6.8866,
+Final estimate: PPL = 6.8866 +/- 0.16788
+llama_perf_context_print:        load time =    5411.54 ms
+llama_perf_context_print: prompt eval time =  148859.23 ms / 30720 tokens (    4.85 ms per token,   206.37 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  149701.96 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 13951 + ( 6670 =  5472 +      80 +    1117) +        3484 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17113 + ( 5746 =  5472 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  26055 = 25689 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,151 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20817 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  450 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 35.78 GiB (8.50 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 25689.74 MiB
+load_tensors:        CUDA0 model buffer size =  5472.73 MiB
+load_tensors:        CUDA1 model buffer size =  5472.73 MiB
+..................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =  1117.84 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 45.348 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 10.14 seconds per pass - ETA 2.70 minutes
+[1]2.6609,[2]2.8390,[3]3.2814,[4]3.5349,[5]4.0776,[6]4.3592,[7]4.5794,[8]4.7074,[9]4.8513,[10]5.0108,[11]5.0902,[12]5.1610,[13]5.2988,[14]5.4088,[15]5.4411,[16]5.4474,
+Final estimate: PPL = 5.4474 +/- 0.12099
+llama_perf_context_print:        load time =    4891.84 ms
+llama_perf_context_print: prompt eval time =  158836.02 ms / 32768 tokens (    4.85 ms per token,   206.30 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  159331.15 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 13949 + ( 6670 =  5472 +      80 +    1117) +        3486 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17113 + ( 5746 =  5472 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  26055 = 25689 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-Q8/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B MXFP4 MoE         |  33.54 GiB |    36.15 B | CUDA       |  35 |             pp8 |         20.55 ± 0.66 |
+| seed_oss 36B MXFP4 MoE         |  33.54 GiB |    36.15 B | CUDA       |  35 |           tg128 |          3.11 ± 0.00 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20815 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  384 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 33.54 GiB (7.97 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 23935.11 MiB
+load_tensors:        CUDA0 model buffer size =  5207.11 MiB
+load_tensors:        CUDA1 model buffer size =  5207.11 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.913 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.71 seconds per pass - ETA 7.77 minutes
+[1]1.5039,[2]1.4366,[3]1.2732,[4]1.2227,[5]1.1808,[6]1.2697,[7]1.3771,[8]1.4370,[9]1.4212,[10]1.3989,[11]1.3764,[12]1.3833,[13]1.3834,[14]1.3692,[15]1.3523,[16]1.3691,[17]1.3709,[18]1.3525,[19]1.3494,[20]1.3651,[21]1.3555,[22]1.3455,[23]1.3559,[24]1.3503,[25]1.3543,[26]1.3507,[27]1.3683,[28]1.3735,[29]1.3738,[30]1.3740,[31]1.3711,[32]1.3820,[33]1.3821,[34]1.3743,[35]1.3703,[36]1.3658,[37]1.3738,[38]1.3828,[39]1.3743,[40]1.3961,[41]1.4052,[42]1.4080,[43]1.4167,[44]1.4181,[45]1.4116,[46]1.4146,[47]1.4186,[48]1.4199,
+Final estimate: PPL = 1.4199 +/- 0.00955
+llama_perf_context_print:        load time =    4967.46 ms
+llama_perf_context_print: prompt eval time =  456537.28 ms / 98304 tokens (    4.64 ms per token,   215.33 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  458289.54 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14616 + ( 6002 =  5207 +      80 +     715) +        3487 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17379 + ( 5481 =  5207 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  24301 = 23935 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20817 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  384 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 33.54 GiB (7.97 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 23935.11 MiB
+load_tensors:        CUDA0 model buffer size =  5207.11 MiB
+load_tensors:        CUDA1 model buffer size =  5207.11 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.496 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.74 seconds per pass - ETA 2.43 minutes
+[1]7.1500,[2]8.2084,[3]8.6095,[4]8.2993,[5]8.0678,[6]6.7890,[7]5.9811,[8]6.0378,[9]6.3209,[10]6.3818,[11]6.5155,[12]6.8651,[13]6.8791,[14]6.9610,[15]6.9649,
+Final estimate: PPL = 6.9649 +/- 0.16907
+llama_perf_context_print:        load time =    4554.25 ms
+llama_perf_context_print: prompt eval time =  142787.36 ms / 30720 tokens (    4.65 ms per token,   215.15 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  143251.19 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14615 + ( 6002 =  5207 +      80 +     715) +        3489 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17379 + ( 5481 =  5207 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  24301 = 23935 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20816 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  384 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 33.54 GiB (7.97 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 23935.11 MiB
+load_tensors:        CUDA0 model buffer size =  5207.11 MiB
+load_tensors:        CUDA1 model buffer size =  5207.11 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.607 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.70 seconds per pass - ETA 2.58 minutes
+[1]2.6818,[2]2.8306,[3]3.2874,[4]3.5465,[5]4.1012,[6]4.3894,[7]4.6191,[8]4.7375,[9]4.8855,[10]5.0570,[11]5.1436,[12]5.2350,[13]5.3773,[14]5.4922,[15]5.5241,[16]5.5327,
+Final estimate: PPL = 5.5327 +/- 0.12265
+llama_perf_context_print:        load time =    4565.69 ms
+llama_perf_context_print: prompt eval time =  151956.96 ms / 32768 tokens (    4.64 ms per token,   215.64 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  152450.77 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14620 + ( 6002 =  5207 +      80 +     715) +        3483 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17379 + ( 5481 =  5207 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  24301 = 23935 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q6_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B MXFP4 MoE         |  33.72 GiB |    36.15 B | CUDA       |  35 |             pp8 |         19.48 ± 1.28 |
+| seed_oss 36B MXFP4 MoE         |  33.72 GiB |    36.15 B | CUDA       |  35 |           tg128 |          3.09 ± 0.01 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  385 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 33.72 GiB (8.01 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 24118.57 MiB
+load_tensors:        CUDA0 model buffer size =  5207.11 MiB
+load_tensors:        CUDA1 model buffer size =  5207.11 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.758 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.71 seconds per pass - ETA 7.77 minutes
+[1]1.5028,[2]1.4359,[3]1.2728,[4]1.2224,[5]1.1806,[6]1.2696,[7]1.3771,[8]1.4369,[9]1.4213,[10]1.3990,[11]1.3766,[12]1.3833,[13]1.3833,[14]1.3692,[15]1.3523,[16]1.3690,[17]1.3708,[18]1.3524,[19]1.3493,[20]1.3652,[21]1.3556,[22]1.3455,[23]1.3559,[24]1.3502,[25]1.3543,[26]1.3506,[27]1.3683,[28]1.3733,[29]1.3736,[30]1.3738,[31]1.3709,[32]1.3817,[33]1.3819,[34]1.3740,[35]1.3700,[36]1.3655,[37]1.3736,[38]1.3826,[39]1.3741,[40]1.3959,[41]1.4050,[42]1.4078,[43]1.4165,[44]1.4179,[45]1.4114,[46]1.4144,[47]1.4184,[48]1.4198,
+Final estimate: PPL = 1.4198 +/- 0.00955
+llama_perf_context_print:        load time =    4600.76 ms
+llama_perf_context_print: prompt eval time =  454918.27 ms / 98304 tokens (    4.63 ms per token,   216.09 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  456481.77 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14620 + ( 6002 =  5207 +      80 +     715) +        3483 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17379 + ( 5481 =  5207 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  24484 = 24118 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  385 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 33.72 GiB (8.01 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 24118.57 MiB
+load_tensors:        CUDA0 model buffer size =  5207.11 MiB
+load_tensors:        CUDA1 model buffer size =  5207.11 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 68.738 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 17.37 seconds per pass - ETA 4.33 minutes
+[1]7.1376,[2]8.2054,[3]8.6103,[4]8.2935,[5]8.0625,[6]6.7858,[7]5.9786,[8]6.0365,[9]6.3174,[10]6.3803,[11]6.5135,[12]6.8626,[13]6.8769,[14]6.9592,[15]6.9638,
+Final estimate: PPL = 6.9638 +/- 0.16907
+llama_perf_context_print:        load time =    8554.11 ms
+llama_perf_context_print: prompt eval time =  150106.42 ms / 30720 tokens (    4.89 ms per token,   204.65 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  150654.44 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14620 + ( 6002 =  5207 +      80 +     715) +        3483 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17379 + ( 5481 =  5207 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  24484 = 24118 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,152 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  385 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 33.72 GiB (8.01 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 24118.57 MiB
+load_tensors:        CUDA0 model buffer size =  5207.11 MiB
+load_tensors:        CUDA1 model buffer size =  5207.11 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 43.7 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.69 seconds per pass - ETA 2.58 minutes
+[1]2.6870,[2]2.8316,[3]3.2916,[4]3.5489,[5]4.1042,[6]4.3929,[7]4.6231,[8]4.7409,[9]4.8891,[10]5.0611,[11]5.1465,[12]5.2375,[13]5.3797,[14]5.4944,[15]5.5263,[16]5.5341,
+Final estimate: PPL = 5.5341 +/- 0.12273
+llama_perf_context_print:        load time =    4608.15 ms
+llama_perf_context_print: prompt eval time =  151699.18 ms / 32768 tokens (    4.63 ms per token,   216.01 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  152194.46 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14627 + ( 6002 =  5207 +      80 +     715) +        3476 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17379 + ( 5481 =  5207 +      80 +     194) +        1263 |
+llama_memory_breakdown_print: |   - Host               |                  24484 = 24118 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-embd_q8/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/llamabench.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| seed_oss 36B MXFP4 MoE         |  31.50 GiB |    36.15 B | CUDA       |  35 |             pp8 |         21.61 ± 0.31 |
+| seed_oss 36B MXFP4 MoE         |  31.50 GiB |    36.15 B | CUDA       |  35 |           tg128 |          3.30 ± 0.00 |
+build: 92bb442ad (7040)

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_code.txt ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20825 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  320 tensors
+llama_model_loader: - type q6_K:   65 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 31.50 GiB (7.48 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 22496.52 MiB
+load_tensors:        CUDA0 model buffer size =  4880.16 MiB
+load_tensors:        CUDA1 model buffer size =  4880.16 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.832 ms
+perplexity: calculating perplexity over 48 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.34 seconds per pass - ETA 7.47 minutes
+[1]1.5043,[2]1.4364,[3]1.2731,[4]1.2222,[5]1.1804,[6]1.2691,[7]1.3768,[8]1.4367,[9]1.4211,[10]1.3988,[11]1.3764,[12]1.3832,[13]1.3832,[14]1.3690,[15]1.3520,[16]1.3687,[17]1.3707,[18]1.3523,[19]1.3493,[20]1.3651,[21]1.3555,[22]1.3454,[23]1.3557,[24]1.3501,[25]1.3542,[26]1.3505,[27]1.3680,[28]1.3731,[29]1.3734,[30]1.3736,[31]1.3708,[32]1.3817,[33]1.3819,[34]1.3740,[35]1.3700,[36]1.3654,[37]1.3735,[38]1.3825,[39]1.3740,[40]1.3959,[41]1.4049,[42]1.4077,[43]1.4163,[44]1.4177,[45]1.4112,[46]1.4143,[47]1.4182,[48]1.4196,
+Final estimate: PPL = 1.4196 +/- 0.00955
+llama_perf_context_print:        load time =    4289.82 ms
+llama_perf_context_print: prompt eval time =  437290.20 ms / 98304 tokens (    4.45 ms per token,   224.80 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  438753.19 ms / 98305 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14946 + ( 5675 =  4880 +      80 +     715) +        3484 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17705 + ( 5154 =  4880 +      80 +     194) +        1264 |
+llama_memory_breakdown_print: |   - Host               |                  22862 = 22496 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_general.txt ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20825 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  320 tensors
+llama_model_loader: - type q6_K:   65 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 31.50 GiB (7.48 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 22496.52 MiB
+load_tensors:        CUDA0 model buffer size =  4880.16 MiB
+load_tensors:        CUDA1 model buffer size =  4880.16 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.169 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.35 seconds per pass - ETA 2.33 minutes
+[1]7.1294,[2]8.1907,[3]8.6017,[4]8.2992,[5]8.0687,[6]6.7895,[7]5.9829,[8]6.0366,[9]6.3191,[10]6.3793,[11]6.5139,[12]6.8639,[13]6.8784,[14]6.9598,[15]6.9647,
+Final estimate: PPL = 6.9647 +/- 0.16904
+llama_perf_context_print:        load time =    4633.11 ms
+llama_perf_context_print: prompt eval time =  136636.03 ms / 30720 tokens (    4.45 ms per token,   224.83 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  137103.47 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14951 + ( 5675 =  4880 +      80 +     715) +        3479 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17705 + ( 5154 =  4880 +      80 +     194) +        1264 |
+llama_memory_breakdown_print: |   - Host               |                  22862 = 22496 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/perplexity_math.txt ADDED Viewed

	@@ -0,0 +1,153 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20820 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23059 MiB free
+llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from /mnt/world7/AI/Models/Seed-OSS-36B-Instruct-unsloth/GGUF/MXFP4/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = seed_oss
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Seed OSS 36B Instruct Unsloth
+llama_model_loader: - kv   3:                           general.finetune str              = Instruct-unsloth
+llama_model_loader: - kv   4:                           general.basename str              = Seed-OSS
+llama_model_loader: - kv   5:                         general.size_label str              = 36B
+llama_model_loader: - kv   6:                            general.license str              = apache-2.0
+llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
+llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed OSS 36B Instruct
+llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
+llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
+llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["vllm", "unsloth", "text-generation"]
+llama_model_loader: - kv  12:                       seed_oss.block_count u32              = 64
+llama_model_loader: - kv  13:                    seed_oss.context_length u32              = 524288
+llama_model_loader: - kv  14:                  seed_oss.embedding_length u32              = 5120
+llama_model_loader: - kv  15:               seed_oss.feed_forward_length u32              = 27648
+llama_model_loader: - kv  16:              seed_oss.attention.head_count u32              = 80
+llama_model_loader: - kv  17:           seed_oss.attention.head_count_kv u32              = 8
+llama_model_loader: - kv  18:                    seed_oss.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  19:  seed_oss.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  20:              seed_oss.attention.key_length u32              = 128
+llama_model_loader: - kv  21:            seed_oss.attention.value_length u32              = 128
+llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = seed-coder
+llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,155136]  = ["<seed:bos>", "<seed:pad>", "<seed:e...
+llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, ...
+llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
+llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 0
+llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
+llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 1
+llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{# ...
+llama_model_loader: - kv  31:               general.quantization_version u32              = 2
+llama_model_loader: - kv  32:                          general.file_type u32              = 38
+llama_model_loader: - type  f32:  321 tensors
+llama_model_loader: - type q8_0:  320 tensors
+llama_model_loader: - type q6_K:   65 tensors
+llama_model_loader: - type mxfp4:   65 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = MXFP4 MoE
+print_info: file size   = 31.50 GiB (7.48 BPW)
+load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
+load: printing all EOG tokens:
+load:   - 2 ('<seed:eos>')
+load: special tokens cache size = 128
+load: token to piece cache size = 0.9296 MB
+print_info: arch             = seed_oss
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 524288
+print_info: n_embd           = 5120
+print_info: n_embd_inp       = 5120
+print_info: n_layer          = 64
+print_info: n_head           = 80
+print_info: n_head_kv        = 8
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 10
+print_info: n_embd_k_gqa     = 1024
+print_info: n_embd_v_gqa     = 1024
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 27648
+print_info: n_expert         = 0
+print_info: n_expert_used    = 0
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 524288
+print_info: rope_finetuned   = unknown
+print_info: model type       = 36B
+print_info: model params     = 36.15 B
+print_info: general.name     = Seed OSS 36B Instruct Unsloth
+print_info: vocab type       = BPE
+print_info: n_vocab          = 155136
+print_info: n_merges         = 154737
+print_info: BOS token        = 0 '<seed:bos>'
+print_info: EOS token        = 2 '<seed:eos>'
+print_info: PAD token        = 1 '<seed:pad>'
+print_info: LF token         = 326 'Ċ'
+print_info: EOG token        = 2 '<seed:eos>'
+print_info: max token length = 1024
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/65 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 22496.52 MiB
+load_tensors:        CUDA0 model buffer size =  4880.16 MiB
+load_tensors:        CUDA1 model buffer size =  4880.16 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (524288) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.59 MiB
+llama_kv_cache:        CPU KV buffer size =   352.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    80.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    80.00 MiB
+llama_kv_cache: size =  512.00 MiB (  2048 cells,  64 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   715.42 MiB
+llama_context:      CUDA1 compute buffer size =   194.01 MiB
+llama_context:  CUDA_Host compute buffer size =    14.01 MiB
+llama_context: graph nodes  = 2183
+llama_context: graph splits = 621 (with bs=512), 4 (with bs=1)
+common_init_from_params: added <seed:eos> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 45.232 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.33 seconds per pass - ETA 2.48 minutes
+[1]2.6803,[2]2.8322,[3]3.2930,[4]3.5483,[5]4.1021,[6]4.3899,[7]4.6199,[8]4.7387,[9]4.8877,[10]5.0587,[11]5.1451,[12]5.2368,[13]5.3790,[14]5.4929,[15]5.5253,[16]5.5326,
+Final estimate: PPL = 5.5326 +/- 0.12270
+llama_perf_context_print:        load time =    4294.54 ms
+llama_perf_context_print: prompt eval time =  145679.92 ms / 32768 tokens (    4.45 ms per token,   224.93 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  146170.11 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24107 = 14946 + ( 5675 =  4880 +      80 +     715) +        3484 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 17705 + ( 5154 =  4880 +      80 +     194) +        1264 |
+llama_memory_breakdown_print: |   - Host               |                  22862 = 22496 +     352 +      14                |

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_code.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_general.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

Benchmarks/Seed-OSS-36B-Instruct-unsloth-MXFP4_MOE-output_mxfp4-router_gate_emb_q6_K/ppl_corpus_math.txt ADDED Viewed

The diff for this file is too large to render. See raw diff