Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,217 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- Qwen/Qwen2.5-0.5B-Instruct
|
| 5 |
+
datasets:
|
| 6 |
+
- agentlans/common-crawl-sample
|
| 7 |
+
- bigcode/the-stack-smol-xl
|
| 8 |
+
- rombodawg/Everything_Instruct
|
| 9 |
+
tags:
|
| 10 |
+
- draft
|
| 11 |
+
- speculative-decoding
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
A `0.4B` parameter draft (speculative decoding) model for use with [Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411) and [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407).
|
| 15 |
+
|
| 16 |
+
See [Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF](https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Extending the context above 32k
|
| 22 |
+
|
| 23 |
+
The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg:
|
| 24 |
+
|
| 25 |
+
## To extend the context to 64k:
|
| 26 |
+
|
| 27 |
+
```json
|
| 28 |
+
"max_position_embeddings": 65536,
|
| 29 |
+
...
|
| 30 |
+
"rope_scaling": {
|
| 31 |
+
"factor": 2.0,
|
| 32 |
+
"original_max_position_embeddings": 32768,
|
| 33 |
+
"type": "yarn"
|
| 34 |
+
},
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## To extend the context to 128k:
|
| 38 |
+
|
| 39 |
+
```json
|
| 40 |
+
"max_position_embeddings": 131072,
|
| 41 |
+
...
|
| 42 |
+
"rope_scaling": {
|
| 43 |
+
"factor": 4.0,
|
| 44 |
+
"original_max_position_embeddings": 32768,
|
| 45 |
+
"type": "yarn"
|
| 46 |
+
},
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
**NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required...
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
# How this model was created
|
| 54 |
+
|
| 55 |
+
## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
|
| 56 |
+
|
| 57 |
+
```sh
|
| 58 |
+
> python ./transplant_vocab.py \
|
| 59 |
+
./Qwen2.5-0.5B-Instruct \
|
| 60 |
+
./Mistral-Large-Instruct-2411 \
|
| 61 |
+
./Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED \
|
| 62 |
+
--override "<unk>" "<|endoftext|>" \
|
| 63 |
+
--override "<s>" "<|endoftext|>" \
|
| 64 |
+
--override "</s>" "<|im_end|>" \
|
| 65 |
+
--override "[INST]" "<|im_start|>user\n" \
|
| 66 |
+
--override "[/INST]" "<|im_end|><|im_start|>assistant\n" \
|
| 67 |
+
--override "[TOOL_CALLS]" "<tool_call>" \
|
| 68 |
+
--override "[AVAILABLE_TOOLS]" "<tools>" \
|
| 69 |
+
--override "[/AVAILABLE_TOOLS]" "</tools>" \
|
| 70 |
+
--override "[TOOL_RESULTS]" "<tool_response>" \
|
| 71 |
+
--override "[/TOOL_RESULTS]" "</tool_response>" \
|
| 72 |
+
--override "[IMG]" "<|vision_start|>" \
|
| 73 |
+
--override "[PREFIX]" "<|fim_prefix|>" \
|
| 74 |
+
--override "[MIDDLE]" "<|fim_middle|>" \
|
| 75 |
+
--override "[SUFFIX]" "<|fim_suffix|>" \
|
| 76 |
+
--override "[IMG_BREAK]" "<|vision_pad|>" \
|
| 77 |
+
--override "[IMG_END]" "<|vision_end|>" \
|
| 78 |
+
--override "[SYSTEM_PROMPT]" "<|im_start|>system\n" \
|
| 79 |
+
--override "[/SYSTEM_PROMPT]" "<|im_end|>" \
|
| 80 |
+
--override "[TOOL_CONTENT]" "<tool_response>"
|
| 81 |
+
|
| 82 |
+
Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
|
| 83 |
+
Loading config from 'Mistral-Large-Instruct-2411'... Done.
|
| 84 |
+
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
|
| 85 |
+
Loading tokenizer from 'Mistral-Large-Instruct-2411'... Done.
|
| 86 |
+
Loading model from 'Qwen2.5-0.5B-Instruct'...
|
| 87 |
+
|
| 88 |
+
Input model configuration:
|
| 89 |
+
- Target vocabulary size : 32768 (used = 32768, unused = 0)
|
| 90 |
+
- Donor vocabulary size : 151936
|
| 91 |
+
- Donor num layers : 24 (tied embeddings = True)
|
| 92 |
+
- Donor hidden size : 896
|
| 93 |
+
- Donor attention heads : 14
|
| 94 |
+
- Donor intermediate size : 4864 (ratio = 1:5.4)
|
| 95 |
+
- Donor total parameters : 494032768 (0.49B)
|
| 96 |
+
-- Embedding parameters : 136134656 (0.14B)
|
| 97 |
+
-- Non-embedding parameters : 357898112 (0.36B)
|
| 98 |
+
|
| 99 |
+
Processing 3 automatic token overrides:
|
| 100 |
+
β 'bos_token_id' : 1 '<s>' β [151643] '<|endoftext|>'
|
| 101 |
+
β 'eos_token_id' : 2 '</s>' β [151645] '<|im_end|>'
|
| 102 |
+
β 'pad_token_id' : Not found for target model
|
| 103 |
+
|
| 104 |
+
Processing 19 manual token overrides:
|
| 105 |
+
β 0 : '<unk>' β [151643] '<|endoftext|>'
|
| 106 |
+
β 1 : '<s>' β [151643] '<|endoftext|>'
|
| 107 |
+
β 2 : '</s>' β [151645] '<|im_end|>'
|
| 108 |
+
β 3 : '[INST]' β [151644, 872, 198] '<|im_start|>user\n'
|
| 109 |
+
β 4 : '[/INST]' β [151645, 151644, 77091, 198] '<|im_end|><|im_start|>assistant\n'
|
| 110 |
+
β 5 : '[TOOL_CALLS]' β [151657] '<tool_call>'
|
| 111 |
+
β 6 : '[AVAILABLE_TOOLS]' β [27, 15918, 29] '<tools>'
|
| 112 |
+
β 7 : '[/AVAILABLE_TOOLS]' β [522, 15918, 29] '</tools>'
|
| 113 |
+
β 8 : '[TOOL_RESULTS]' β [27, 14172, 9655, 29] '<tool_response>'
|
| 114 |
+
β 9 : '[/TOOL_RESULTS]' β [522, 14172, 9655, 29] '</tool_response>'
|
| 115 |
+
β 10 : '[IMG]' β [151652] '<|vision_start|>'
|
| 116 |
+
β 11 : '[PREFIX]' β [151659] '<|fim_prefix|>'
|
| 117 |
+
β 12 : '[MIDDLE]' β [151660] '<|fim_middle|>'
|
| 118 |
+
β 13 : '[SUFFIX]' β [151661] '<|fim_suffix|>'
|
| 119 |
+
β 14 : '[IMG_BREAK]' β [151654] '<|vision_pad|>'
|
| 120 |
+
β 15 : '[IMG_END]' β [151653] '<|vision_end|>'
|
| 121 |
+
β 16 : '[SYSTEM_PROMPT]' β [151644, 8948, 198] '<|im_start|>system\n'
|
| 122 |
+
β 17 : '[/SYSTEM_PROMPT]' β [151645] '<|im_end|>'
|
| 123 |
+
β 18 : '[TOOL_CONTENT]' β [27, 14172, 9655, 29] '<tool_response>'
|
| 124 |
+
|
| 125 |
+
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
|
| 126 |
+
|
| 127 |
+
Transplanting tokens: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 32768/32768 [00:09<00:00, 3311.13token/s]
|
| 128 |
+
|
| 129 |
+
Transplant mappings:
|
| 130 |
+
- 1 to 1 : 29370 (90%)
|
| 131 |
+
- 2 to 1 : 2445 (7.5%)
|
| 132 |
+
- 3 to 1 : 170 (0.52%)
|
| 133 |
+
- 4 to 1 : 29 (0.089%)
|
| 134 |
+
- 5 to 1 : 3 (0.0092%)
|
| 135 |
+
- 6 to 1 : 93 (0.28%)
|
| 136 |
+
- 7 to 1 : 658 (2%)
|
| 137 |
+
|
| 138 |
+
Head initialized with:
|
| 139 |
+
- Copies : 29370 (90%)
|
| 140 |
+
- Means : 3398 (10%)
|
| 141 |
+
- Zeros : 0 (0%)
|
| 142 |
+
|
| 143 |
+
Output model configuration:
|
| 144 |
+
- Output vocabulary size : 32768
|
| 145 |
+
- Output num layers : 24 (tied embeddings = False)
|
| 146 |
+
- Output hidden size : 896
|
| 147 |
+
- Output attention heads : 14
|
| 148 |
+
- Output intermediate size : 4864 (ratio = 1:5.4)
|
| 149 |
+
- Output total parameters : 416618368 (0.42B)
|
| 150 |
+
-- Embedding parameters : 58720256 (0.06B)
|
| 151 |
+
-- Non-embedding parameters : 357898112 (0.36B)
|
| 152 |
+
|
| 153 |
+
Saving model and tokenizer to 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' folder
|
| 154 |
+
|
| 155 |
+
Patching 'torch_dtype' in 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED/config.json' based on actual saved tensors
|
| 156 |
+
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
|
| 157 |
+
|
| 158 |
+
Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
## 2. The following datasets were used to create a fine-tuning dataset of ~2.8B tokens:
|
| 162 |
+
|
| 163 |
+
- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
|
| 164 |
+
- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
|
| 165 |
+
- [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only)
|
| 166 |
+
|
| 167 |
+
formatted just between `</s>` tags.
|
| 168 |
+
|
| 169 |
+
## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):
|
| 170 |
+
|
| 171 |
+
```toml
|
| 172 |
+
# ==============================
|
| 173 |
+
# MODEL AND OUTPUT CONFIGURATION
|
| 174 |
+
# ==============================
|
| 175 |
+
|
| 176 |
+
model_dir = 'models/Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED'
|
| 177 |
+
output_dir = 'finetuned'
|
| 178 |
+
|
| 179 |
+
# ===========================
|
| 180 |
+
# TRAINING TYPE CONFIGURATION
|
| 181 |
+
# ===========================
|
| 182 |
+
|
| 183 |
+
full_fine_tune = true
|
| 184 |
+
|
| 185 |
+
# =======================
|
| 186 |
+
# OPTIMIZER CONFIGURATION
|
| 187 |
+
# =======================
|
| 188 |
+
|
| 189 |
+
lr = 5e-5
|
| 190 |
+
|
| 191 |
+
# ======================
|
| 192 |
+
# TRAINING CONFIGURATION
|
| 193 |
+
# ======================
|
| 194 |
+
|
| 195 |
+
sequence_len = 32768
|
| 196 |
+
|
| 197 |
+
gradient_accumulation_steps = 10 # 10Γ6 = batch size 60, 10Γ6Γ32768 = ~2M tokens per step
|
| 198 |
+
|
| 199 |
+
# =====================
|
| 200 |
+
# DATASET CONFIGURATION
|
| 201 |
+
# =====================
|
| 202 |
+
|
| 203 |
+
drop_tails = true
|
| 204 |
+
|
| 205 |
+
[[datasets]]
|
| 206 |
+
dataset_path = 'datasets/common-crawl-sample/*.json'
|
| 207 |
+
|
| 208 |
+
[[datasets]]
|
| 209 |
+
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
|
| 210 |
+
|
| 211 |
+
[[datasets]]
|
| 212 |
+
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`).
|
| 216 |
+
|
| 217 |
+

|