robbiemu
/

MobileLLM-R1-950M-MLX

@@ -319,7 +319,6 @@ Details
 - The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
 - Attention uses standard `1/sqrt(d)` scaling for best generation quality.
-```markdown
 ## Installation
 This project uses `uv` for dependency management.
@@ -335,7 +334,6 @@ uv sync
 # 3. (Optional) Add the torch group if you plan to customize/train models
 uv sync --extra torch
-```
 ### Without uv
 If you prefer pip/venv, a `requirements.txt` is provided:
@@ -346,7 +344,6 @@ pip install -r requirements.txt
 ```
 > The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
-```
 ## MLX Inference Examples (safetensors)
@@ -377,7 +374,7 @@ This runtime mirrors the functional details of the released weights so they load
   - Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
 - Template and decoding
-  - Provide a Jinja chat template for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
   - Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
 # Model Details
@@ -436,7 +433,7 @@ Compared to existing fully open-source models, MobileLLM-R1 950M model achieves
 # How to use
 To load the pretrained model for further finetuning or evaluation:
-```bash
 from transformers import AutoModelForCausalLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
 model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
@@ -467,7 +464,17 @@ Flags in `inference.py`
 See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
-Transformers
 ```py
 from transformers import pipeline

 - The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
 - Attention uses standard `1/sqrt(d)` scaling for best generation quality.
 ## Installation
 This project uses `uv` for dependency management.
 # 3. (Optional) Add the torch group if you plan to customize/train models
 uv sync --extra torch
 ### Without uv
 If you prefer pip/venv, a `requirements.txt` is provided:
 ```
 > The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
 ## MLX Inference Examples (safetensors)
   - Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
 - Template and decoding
+  - The provided Jinja chat template is supported for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
   - Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
 # Model Details
 # How to use
 To load the pretrained model for further finetuning or evaluation:
+```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
 model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
 See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
+## Inference (MLX-LM)
+Two mlx-lm models are also provided, a conversion and a dynamic 4 bit quantization. code to reproduce and a handy inference runtime are provided in custom_mlx_lm/. After installation the following examples should work (I am forgetting, you may need to first copy the model into mlx_lm/ as `llama4_text.py`)
+```bash
+mobilellm-infer --model-path MobileLLM-R1-950M-mixed-4bit-mlx --prompt "What is the nearest prime to 9^2?
+mobilellm-infer --model-path MobileLLM-R1-950M-mlx/ --prompt "What is the nearest prime to 9^2?"
+```
+## Transformers
 ```py
 from transformers import pipeline