jukofyork
/

Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-0.5B-Instruct
+datasets:
+- agentlans/common-crawl-sample
+- bigcode/the-stack-smol-xl
+- rombodawg/Everything_Instruct
+tags:
+- draft
+- speculative-decoding
+---
+A `0.4B` parameter draft (speculative decoding) model for use with [Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411) and [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407).
+See [Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF](https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`.
+---
+# Extending the context above 32k
+The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg:
+## To extend the context to 64k:
+```json
+  "max_position_embeddings": 65536,
+  ...
+  "rope_scaling": {
+    "factor": 2.0,
+    "original_max_position_embeddings": 32768,
+    "type": "yarn"
+  },
+```
+## To extend the context to 128k:
+```json
+  "max_position_embeddings": 131072,
+  ...
+  "rope_scaling": {
+    "factor": 4.0,
+    "original_max_position_embeddings": 32768,
+    "type": "yarn"
+  },
+```
+**NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required...
+---
+# How this model was created
+## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
+```sh
+> python ./transplant_vocab.py \
+    ./Qwen2.5-0.5B-Instruct \
+    ./Mistral-Large-Instruct-2411 \
+    ./Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED \
+    --override "<unk>" "<|endoftext|>" \
+    --override "<s>" "<|endoftext|>" \
+    --override "</s>" "<|im_end|>" \
+    --override "[INST]" "<|im_start|>user\n" \
+    --override "[/INST]" "<|im_end|><|im_start|>assistant\n" \
+    --override "[TOOL_CALLS]" "<tool_call>" \
+    --override "[AVAILABLE_TOOLS]" "<tools>" \
+    --override "[/AVAILABLE_TOOLS]" "</tools>" \
+    --override "[TOOL_RESULTS]" "<tool_response>" \
+    --override "[/TOOL_RESULTS]" "</tool_response>" \
+    --override "[IMG]" "<|vision_start|>" \
+    --override "[PREFIX]" "<|fim_prefix|>" \
+    --override "[MIDDLE]" "<|fim_middle|>" \
+    --override "[SUFFIX]" "<|fim_suffix|>" \
+    --override "[IMG_BREAK]" "<|vision_pad|>" \
+    --override "[IMG_END]" "<|vision_end|>" \
+    --override "[SYSTEM_PROMPT]" "<|im_start|>system\n" \
+    --override "[/SYSTEM_PROMPT]" "<|im_end|>" \
+    --override "[TOOL_CONTENT]" "<tool_response>"
+Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
+Loading config from 'Mistral-Large-Instruct-2411'... Done.
+Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
+Loading tokenizer from 'Mistral-Large-Instruct-2411'... Done.
+Loading model from 'Qwen2.5-0.5B-Instruct'...
+Input model configuration:
+- Target vocabulary size    : 32768 (used = 32768, unused = 0)
+- Donor vocabulary size     : 151936
+- Donor num layers          : 24 (tied embeddings = True)
+- Donor hidden size         : 896
+- Donor attention heads     : 14
+- Donor intermediate size   : 4864 (ratio = 1:5.4)
+- Donor total parameters    : 494032768 (0.49B)
+-- Embedding parameters     : 136134656 (0.14B)
+-- Non-embedding parameters : 357898112 (0.36B)
+Processing 3 automatic token overrides:
+✔ 'bos_token_id' : 1 '<s>' → [151643] '<|endoftext|>'
+✔ 'eos_token_id' : 2 '</s>' → [151645] '<|im_end|>'
+✘ 'pad_token_id' : Not found for target model
+Processing 19 manual token overrides:
+✔      0 : '<unk>' → [151643] '<|endoftext|>'
+✔      1 : '<s>' → [151643] '<|endoftext|>'
+✔      2 : '</s>' → [151645] '<|im_end|>'
+✔      3 : '[INST]' → [151644, 872, 198] '<|im_start|>user\n'
+✔      4 : '[/INST]' → [151645, 151644, 77091, 198] '<|im_end|><|im_start|>assistant\n'
+✔      5 : '[TOOL_CALLS]' → [151657] '<tool_call>'
+✔      6 : '[AVAILABLE_TOOLS]' → [27, 15918, 29] '<tools>'
+✔      7 : '[/AVAILABLE_TOOLS]' → [522, 15918, 29] '</tools>'
+✔      8 : '[TOOL_RESULTS]' → [27, 14172, 9655, 29] '<tool_response>'
+✔      9 : '[/TOOL_RESULTS]' → [522, 14172, 9655, 29] '</tool_response>'
+✔     10 : '[IMG]' → [151652] '<|vision_start|>'
+✔     11 : '[PREFIX]' → [151659] '<|fim_prefix|>'
+✔     12 : '[MIDDLE]' → [151660] '<|fim_middle|>'
+✔     13 : '[SUFFIX]' → [151661] '<|fim_suffix|>'
+✔     14 : '[IMG_BREAK]' → [151654] '<|vision_pad|>'
+✔     15 : '[IMG_END]' → [151653] '<|vision_end|>'
+✔     16 : '[SYSTEM_PROMPT]' → [151644, 8948, 198] '<|im_start|>system\n'
+✔     17 : '[/SYSTEM_PROMPT]' → [151645] '<|im_end|>'
+✔     18 : '[TOOL_CONTENT]' → [27, 14172, 9655, 29] '<tool_response>'
+NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
+Transplanting tokens: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32768/32768 [00:09<00:00, 3311.13token/s]
+Transplant mappings:
+- 1 to 1  : 29370 (90%)
+- 2 to 1  : 2445 (7.5%)
+- 3 to 1  : 170 (0.52%)
+- 4 to 1  : 29 (0.089%)
+- 5 to 1  : 3 (0.0092%)
+- 6 to 1  : 93 (0.28%)
+- 7 to 1  : 658 (2%)
+Head initialized with:
+- Copies : 29370 (90%)
+- Means  : 3398 (10%)
+- Zeros  : 0 (0%)
+Output model configuration:
+- Output vocabulary size    : 32768
+- Output num layers         : 24 (tied embeddings = False)
+- Output hidden size        : 896
+- Output attention heads    : 14
+- Output intermediate size  : 4864 (ratio = 1:5.4)
+- Output total parameters   : 416618368 (0.42B)
+-- Embedding parameters     : 58720256 (0.06B)
+-- Non-embedding parameters : 357898112 (0.36B)
+Saving model and tokenizer to 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' folder
+Patching 'torch_dtype' in 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED/config.json' based on actual saved tensors
+- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
+Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
+```
+## 2. The following datasets were used to create a fine-tuning dataset of ~2.8B tokens:
+- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
+- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
+- [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only)
+formatted just between `</s>` tags.
+## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):
+```toml
+# ==============================
+# MODEL AND OUTPUT CONFIGURATION
+# ==============================
+model_dir = 'models/Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED'
+output_dir = 'finetuned'
+# ===========================
+# TRAINING TYPE CONFIGURATION
+# ===========================
+full_fine_tune = true
+# =======================
+# OPTIMIZER CONFIGURATION
+# =======================
+lr = 5e-5
+# ======================
+# TRAINING CONFIGURATION
+# ======================
+sequence_len = 32768
+gradient_accumulation_steps = 10  # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step
+# =====================
+# DATASET CONFIGURATION
+# =====================
+drop_tails = true
+[[datasets]]
+dataset_path = 'datasets/common-crawl-sample/*.json'
+[[datasets]]
+dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
+[[datasets]]
+dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
+```
+I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`).
+![image](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/uARplwAxoskC3XKPYKBeg.png)