fixed some formatting and added mlx-lm examples
Browse files
README.md
CHANGED
|
@@ -319,7 +319,6 @@ Details
|
|
| 319 |
- The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
|
| 320 |
- Attention uses standard `1/sqrt(d)` scaling for best generation quality.
|
| 321 |
|
| 322 |
-
```markdown
|
| 323 |
## Installation
|
| 324 |
|
| 325 |
This project uses `uv` for dependency management.
|
|
@@ -335,7 +334,6 @@ uv sync
|
|
| 335 |
|
| 336 |
# 3. (Optional) Add the torch group if you plan to customize/train models
|
| 337 |
uv sync --extra torch
|
| 338 |
-
```
|
| 339 |
|
| 340 |
### Without uv
|
| 341 |
If you prefer pip/venv, a `requirements.txt` is provided:
|
|
@@ -346,7 +344,6 @@ pip install -r requirements.txt
|
|
| 346 |
```
|
| 347 |
|
| 348 |
> The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
|
| 349 |
-
```
|
| 350 |
|
| 351 |
## MLX Inference Examples (safetensors)
|
| 352 |
|
|
@@ -377,7 +374,7 @@ This runtime mirrors the functional details of the released weights so they load
|
|
| 377 |
- Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
|
| 378 |
|
| 379 |
- Template and decoding
|
| 380 |
-
-
|
| 381 |
- Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
|
| 382 |
|
| 383 |
# Model Details
|
|
@@ -436,7 +433,7 @@ Compared to existing fully open-source models, MobileLLM-R1 950M model achieves
|
|
| 436 |
# How to use
|
| 437 |
|
| 438 |
To load the pretrained model for further finetuning or evaluation:
|
| 439 |
-
```
|
| 440 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 441 |
tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
|
| 442 |
model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
|
|
@@ -467,7 +464,17 @@ Flags in `inference.py`
|
|
| 467 |
|
| 468 |
See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
|
| 469 |
|
| 470 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 471 |
|
| 472 |
```py
|
| 473 |
from transformers import pipeline
|
|
|
|
| 319 |
- The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
|
| 320 |
- Attention uses standard `1/sqrt(d)` scaling for best generation quality.
|
| 321 |
|
|
|
|
| 322 |
## Installation
|
| 323 |
|
| 324 |
This project uses `uv` for dependency management.
|
|
|
|
| 334 |
|
| 335 |
# 3. (Optional) Add the torch group if you plan to customize/train models
|
| 336 |
uv sync --extra torch
|
|
|
|
| 337 |
|
| 338 |
### Without uv
|
| 339 |
If you prefer pip/venv, a `requirements.txt` is provided:
|
|
|
|
| 344 |
```
|
| 345 |
|
| 346 |
> The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
|
|
|
|
| 347 |
|
| 348 |
## MLX Inference Examples (safetensors)
|
| 349 |
|
|
|
|
| 374 |
- Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
|
| 375 |
|
| 376 |
- Template and decoding
|
| 377 |
+
- The provided Jinja chat template is supported for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
|
| 378 |
- Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
|
| 379 |
|
| 380 |
# Model Details
|
|
|
|
| 433 |
# How to use
|
| 434 |
|
| 435 |
To load the pretrained model for further finetuning or evaluation:
|
| 436 |
+
```python
|
| 437 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 438 |
tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
|
| 439 |
model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
|
|
|
|
| 464 |
|
| 465 |
See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
|
| 466 |
|
| 467 |
+
## Inference (MLX-LM)
|
| 468 |
+
|
| 469 |
+
Two mlx-lm models are also provided, a conversion and a dynamic 4 bit quantization. code to reproduce and a handy inference runtime are provided in custom_mlx_lm/. After installation the following examples should work (I am forgetting, you may need to first copy the model into mlx_lm/ as `llama4_text.py`)
|
| 470 |
+
|
| 471 |
+
```bash
|
| 472 |
+
mobilellm-infer --model-path MobileLLM-R1-950M-mixed-4bit-mlx --prompt "What is the nearest prime to 9^2?
|
| 473 |
+
|
| 474 |
+
mobilellm-infer --model-path MobileLLM-R1-950M-mlx/ --prompt "What is the nearest prime to 9^2?"
|
| 475 |
+
```
|
| 476 |
+
|
| 477 |
+
## Transformers
|
| 478 |
|
| 479 |
```py
|
| 480 |
from transformers import pipeline
|