--- license: apache-2.0 base_model: - Qwen/Qwen2.5-0.5B-Instruct datasets: - agentlans/common-crawl-sample - bigcode/the-stack-smol-xl - rombodawg/Everything_Instruct tags: - draft - speculative-decoding --- A `0.4B` parameter draft (speculative decoding) model for use with [Mistral-Large-Instruct-2411](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411) and [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407). See [Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF](https://huggingface.co/jukofyork/Mistral-Large-Instruct-2411-DRAFT-0.4B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`. --- # Extending the context above 32k The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg: ## To extend the context to 64k: ```json "max_position_embeddings": 65536, ... "rope_scaling": { "factor": 2.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` ## To extend the context to 128k: ```json "max_position_embeddings": 131072, ... "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` **NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required... --- # How this model was created ## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab): ```sh > python ./transplant_vocab.py \ ./Qwen2.5-0.5B-Instruct \ ./Mistral-Large-Instruct-2411 \ ./Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED \ --override "" "<|endoftext|>" \ --override "~~" "<|endoftext|>" \ --override "~~" "<|im_end|>" \ --override "[INST]" "<|im_start|>user\n" \ --override "[/INST]" "<|im_end|><|im_start|>assistant\n" \ --override "[TOOL_CALLS]" "" \ --override "[AVAILABLE_TOOLS]" "" \ --override "[/AVAILABLE_TOOLS]" "" \ --override "[TOOL_RESULTS]" "" \ --override "[/TOOL_RESULTS]" "" \ --override "[IMG]" "<|vision_start|>" \ --override "[PREFIX]" "<|fim_prefix|>" \ --override "[MIDDLE]" "<|fim_middle|>" \ --override "[SUFFIX]" "<|fim_suffix|>" \ --override "[IMG_BREAK]" "<|vision_pad|>" \ --override "[IMG_END]" "<|vision_end|>" \ --override "[SYSTEM_PROMPT]" "<|im_start|>system\n" \ --override "[/SYSTEM_PROMPT]" "<|im_end|>" \ --override "[TOOL_CONTENT]" "" Loading config from 'Qwen2.5-0.5B-Instruct'... Done. Loading config from 'Mistral-Large-Instruct-2411'... Done. Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done. Loading tokenizer from 'Mistral-Large-Instruct-2411'... Done. Loading model from 'Qwen2.5-0.5B-Instruct'... Input model configuration: - Target vocabulary size : 32768 (used = 32768, unused = 0) - Donor vocabulary size : 151936 - Donor num layers : 24 (tied embeddings = True) - Donor hidden size : 896 - Donor attention heads : 14 - Donor intermediate size : 4864 (ratio = 1:5.4) - Donor total parameters : 494032768 (0.49B) -- Embedding parameters : 136134656 (0.14B) -- Non-embedding parameters : 357898112 (0.36B) Processing 3 automatic token overrides: ✔ 'bos_token_id' : 1 '~~' → [151643] '<|endoftext|>' ✔ 'eos_token_id' : 2 '~~' → [151645] '<|im_end|>' ✘ 'pad_token_id' : Not found for target model Processing 19 manual token overrides: ✔ 0 : '' → [151643] '<|endoftext|>' ✔ 1 : '~~' → [151643] '<|endoftext|>' ✔ 2 : '~~' → [151645] '<|im_end|>' ✔ 3 : '[INST]' → [151644, 872, 198] '<|im_start|>user\n' ✔ 4 : '[/INST]' → [151645, 151644, 77091, 198] '<|im_end|><|im_start|>assistant\n' ✔ 5 : '[TOOL_CALLS]' → [151657] '' ✔ 6 : '[AVAILABLE_TOOLS]' → [27, 15918, 29] '' ✔ 7 : '[/AVAILABLE_TOOLS]' → [522, 15918, 29] '' ✔ 8 : '[TOOL_RESULTS]' → [27, 14172, 9655, 29] '' ✔ 9 : '[/TOOL_RESULTS]' → [522, 14172, 9655, 29] '' ✔ 10 : '[IMG]' → [151652] '<|vision_start|>' ✔ 11 : '[PREFIX]' → [151659] '<|fim_prefix|>' ✔ 12 : '[MIDDLE]' → [151660] '<|fim_middle|>' ✔ 13 : '[SUFFIX]' → [151661] '<|fim_suffix|>' ✔ 14 : '[IMG_BREAK]' → [151654] '<|vision_pad|>' ✔ 15 : '[IMG_END]' → [151653] '<|vision_end|>' ✔ 16 : '[SYSTEM_PROMPT]' → [151644, 8948, 198] '<|im_start|>system\n' ✔ 17 : '[/SYSTEM_PROMPT]' → [151645] '<|im_end|>' ✔ 18 : '[TOOL_CONTENT]' → [27, 14172, 9655, 29] '' NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor... Transplanting tokens: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32768/32768 [00:09<00:00, 3311.13token/s] Transplant mappings: - 1 to 1 : 29370 (90%) - 2 to 1 : 2445 (7.5%) - 3 to 1 : 170 (0.52%) - 4 to 1 : 29 (0.089%) - 5 to 1 : 3 (0.0092%) - 6 to 1 : 93 (0.28%) - 7 to 1 : 658 (2%) Head initialized with: - Copies : 29370 (90%) - Means : 3398 (10%) - Zeros : 0 (0%) Output model configuration: - Output vocabulary size : 32768 - Output num layers : 24 (tied embeddings = False) - Output hidden size : 896 - Output attention heads : 14 - Output intermediate size : 4864 (ratio = 1:5.4) - Output total parameters : 416618368 (0.42B) -- Embedding parameters : 58720256 (0.06B) -- Non-embedding parameters : 357898112 (0.36B) Saving model and tokenizer to 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' folder Patching 'torch_dtype' in 'Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED/config.json' based on actual saved tensors - Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype Operation completed successfully (ignore any 'segmentation fault' that follows!!!) ``` ## 2. The following datasets were used to create a fine-tuning dataset of ~2.8B tokens: - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) - [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only) formatted just between `` tags. ## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step): ```toml # ============================== # MODEL AND OUTPUT CONFIGURATION # ============================== model_dir = 'models/Mistral-Large-Instruct-2411-DRAFT-0.4B-UNTRAINED' output_dir = 'finetuned' # =========================== # TRAINING TYPE CONFIGURATION # =========================== full_fine_tune = true # ======================= # OPTIMIZER CONFIGURATION # ======================= lr = 5e-5 # ====================== # TRAINING CONFIGURATION # ====================== sequence_len = 32768 gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step # ===================== # DATASET CONFIGURATION # ===================== drop_tails = true [[datasets]] dataset_path = 'datasets/common-crawl-sample/*.json' [[datasets]] dataset_path = 'datasets/the-stack-smol-xl/*.jsonl' [[datasets]] dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json' ``` I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`). ![image](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/uARplwAxoskC3XKPYKBeg.png)