Safetensors
qwen2
draft
speculative-decoding
jukofyork commited on
Commit
5602c40
Β·
verified Β·
1 Parent(s): 9e86c77

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +217 -0
README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-0.5B-Instruct
5
+ datasets:
6
+ - agentlans/common-crawl-sample
7
+ - bigcode/the-stack-smol-xl
8
+ - rombodawg/Everything_Instruct
9
+ tags:
10
+ - draft
11
+ - speculative-decoding
12
+ ---
13
+
14
+ A `0.4B` parameter draft (speculative decoding) model for use with [Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411) and [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407).
15
+
16
+ See [Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF](https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`.
17
+
18
+
19
+ ---
20
+
21
+ # Extending the context above 32k
22
+
23
+ The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg:
24
+
25
+ ## To extend the context to 64k:
26
+
27
+ ```json
28
+ "max_position_embeddings": 65536,
29
+ ...
30
+ "rope_scaling": {
31
+ "factor": 2.0,
32
+ "original_max_position_embeddings": 32768,
33
+ "type": "yarn"
34
+ },
35
+ ```
36
+
37
+ ## To extend the context to 128k:
38
+
39
+ ```json
40
+ "max_position_embeddings": 131072,
41
+ ...
42
+ "rope_scaling": {
43
+ "factor": 4.0,
44
+ "original_max_position_embeddings": 32768,
45
+ "type": "yarn"
46
+ },
47
+ ```
48
+
49
+ **NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required...
50
+
51
+ ---
52
+
53
+ # How this model was created
54
+
55
+ ## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):
56
+
57
+ ```sh
58
+ > python ./transplant_vocab.py \
59
+ ./Qwen2.5-0.5B-Instruct \
60
+ ./Mistral-Large-Instruct-2411 \
61
+ ./Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED \
62
+ --override "<unk>" "<|endoftext|>" \
63
+ --override "<s>" "<|endoftext|>" \
64
+ --override "</s>" "<|im_end|>" \
65
+ --override "[INST]" "<|im_start|>user\n" \
66
+ --override "[/INST]" "<|im_end|><|im_start|>assistant\n" \
67
+ --override "[TOOL_CALLS]" "<tool_call>" \
68
+ --override "[AVAILABLE_TOOLS]" "<tools>" \
69
+ --override "[/AVAILABLE_TOOLS]" "</tools>" \
70
+ --override "[TOOL_RESULTS]" "<tool_response>" \
71
+ --override "[/TOOL_RESULTS]" "</tool_response>" \
72
+ --override "[IMG]" "<|vision_start|>" \
73
+ --override "[PREFIX]" "<|fim_prefix|>" \
74
+ --override "[MIDDLE]" "<|fim_middle|>" \
75
+ --override "[SUFFIX]" "<|fim_suffix|>" \
76
+ --override "[IMG_BREAK]" "<|vision_pad|>" \
77
+ --override "[IMG_END]" "<|vision_end|>" \
78
+ --override "[SYSTEM_PROMPT]" "<|im_start|>system\n" \
79
+ --override "[/SYSTEM_PROMPT]" "<|im_end|>" \
80
+ --override "[TOOL_CONTENT]" "<tool_response>"
81
+
82
+ Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
83
+ Loading config from 'Mistral-Large-Instruct-2411'... Done.
84
+ Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
85
+ Loading tokenizer from 'Mistral-Large-Instruct-2411'... Done.
86
+ Loading model from 'Qwen2.5-0.5B-Instruct'...
87
+
88
+ Input model configuration:
89
+ - Target vocabulary size : 32768 (used = 32768, unused = 0)
90
+ - Donor vocabulary size : 151936
91
+ - Donor num layers : 24 (tied embeddings = True)
92
+ - Donor hidden size : 896
93
+ - Donor attention heads : 14
94
+ - Donor intermediate size : 4864 (ratio = 1:5.4)
95
+ - Donor total parameters : 494032768 (0.49B)
96
+ -- Embedding parameters : 136134656 (0.14B)
97
+ -- Non-embedding parameters : 357898112 (0.36B)
98
+
99
+ Processing 3 automatic token overrides:
100
+ βœ” 'bos_token_id' : 1 '<s>' β†’ [151643] '<|endoftext|>'
101
+ βœ” 'eos_token_id' : 2 '</s>' β†’ [151645] '<|im_end|>'
102
+ ✘ 'pad_token_id' : Not found for target model
103
+
104
+ Processing 19 manual token overrides:
105
+ βœ” 0 : '<unk>' β†’ [151643] '<|endoftext|>'
106
+ βœ” 1 : '<s>' β†’ [151643] '<|endoftext|>'
107
+ βœ” 2 : '</s>' β†’ [151645] '<|im_end|>'
108
+ βœ” 3 : '[INST]' β†’ [151644, 872, 198] '<|im_start|>user\n'
109
+ βœ” 4 : '[/INST]' β†’ [151645, 151644, 77091, 198] '<|im_end|><|im_start|>assistant\n'
110
+ βœ” 5 : '[TOOL_CALLS]' β†’ [151657] '<tool_call>'
111
+ βœ” 6 : '[AVAILABLE_TOOLS]' β†’ [27, 15918, 29] '<tools>'
112
+ βœ” 7 : '[/AVAILABLE_TOOLS]' β†’ [522, 15918, 29] '</tools>'
113
+ βœ” 8 : '[TOOL_RESULTS]' β†’ [27, 14172, 9655, 29] '<tool_response>'
114
+ βœ” 9 : '[/TOOL_RESULTS]' β†’ [522, 14172, 9655, 29] '</tool_response>'
115
+ βœ” 10 : '[IMG]' β†’ [151652] '<|vision_start|>'
116
+ βœ” 11 : '[PREFIX]' β†’ [151659] '<|fim_prefix|>'
117
+ βœ” 12 : '[MIDDLE]' β†’ [151660] '<|fim_middle|>'
118
+ βœ” 13 : '[SUFFIX]' β†’ [151661] '<|fim_suffix|>'
119
+ βœ” 14 : '[IMG_BREAK]' β†’ [151654] '<|vision_pad|>'
120
+ βœ” 15 : '[IMG_END]' β†’ [151653] '<|vision_end|>'
121
+ βœ” 16 : '[SYSTEM_PROMPT]' β†’ [151644, 8948, 198] '<|im_start|>system\n'
122
+ βœ” 17 : '[/SYSTEM_PROMPT]' β†’ [151645] '<|im_end|>'
123
+ βœ” 18 : '[TOOL_CONTENT]' β†’ [27, 14172, 9655, 29] '<tool_response>'
124
+
125
+ NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
126
+
127
+ Transplanting tokens: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 32768/32768 [00:09<00:00, 3311.13token/s]
128
+
129
+ Transplant mappings:
130
+ - 1 to 1 : 29370 (90%)
131
+ - 2 to 1 : 2445 (7.5%)
132
+ - 3 to 1 : 170 (0.52%)
133
+ - 4 to 1 : 29 (0.089%)
134
+ - 5 to 1 : 3 (0.0092%)
135
+ - 6 to 1 : 93 (0.28%)
136
+ - 7 to 1 : 658 (2%)
137
+
138
+ Head initialized with:
139
+ - Copies : 29370 (90%)
140
+ - Means : 3398 (10%)
141
+ - Zeros : 0 (0%)
142
+
143
+ Output model configuration:
144
+ - Output vocabulary size : 32768
145
+ - Output num layers : 24 (tied embeddings = False)
146
+ - Output hidden size : 896
147
+ - Output attention heads : 14
148
+ - Output intermediate size : 4864 (ratio = 1:5.4)
149
+ - Output total parameters : 416618368 (0.42B)
150
+ -- Embedding parameters : 58720256 (0.06B)
151
+ -- Non-embedding parameters : 357898112 (0.36B)
152
+
153
+ Saving model and tokenizer to 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' folder
154
+
155
+ Patching 'torch_dtype' in 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED/config.json' based on actual saved tensors
156
+ - Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
157
+
158
+ Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
159
+ ```
160
+
161
+ ## 2. The following datasets were used to create a fine-tuning dataset of ~2.8B tokens:
162
+
163
+ - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
164
+ - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
165
+ - [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only)
166
+
167
+ formatted just between `</s>` tags.
168
+
169
+ ## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):
170
+
171
+ ```toml
172
+ # ==============================
173
+ # MODEL AND OUTPUT CONFIGURATION
174
+ # ==============================
175
+
176
+ model_dir = 'models/Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED'
177
+ output_dir = 'finetuned'
178
+
179
+ # ===========================
180
+ # TRAINING TYPE CONFIGURATION
181
+ # ===========================
182
+
183
+ full_fine_tune = true
184
+
185
+ # =======================
186
+ # OPTIMIZER CONFIGURATION
187
+ # =======================
188
+
189
+ lr = 5e-5
190
+
191
+ # ======================
192
+ # TRAINING CONFIGURATION
193
+ # ======================
194
+
195
+ sequence_len = 32768
196
+
197
+ gradient_accumulation_steps = 10 # 10Γ—6 = batch size 60, 10Γ—6Γ—32768 = ~2M tokens per step
198
+
199
+ # =====================
200
+ # DATASET CONFIGURATION
201
+ # =====================
202
+
203
+ drop_tails = true
204
+
205
+ [[datasets]]
206
+ dataset_path = 'datasets/common-crawl-sample/*.json'
207
+
208
+ [[datasets]]
209
+ dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
210
+
211
+ [[datasets]]
212
+ dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
213
+ ```
214
+
215
+ I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`).
216
+
217
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/uARplwAxoskC3XKPYKBeg.png)