robbiemu commited on
Commit
c45463c
·
verified ·
1 Parent(s): 1f30715

fixed some formatting and added mlx-lm examples

Browse files
Files changed (1) hide show
  1. README.md +13 -6
README.md CHANGED
@@ -319,7 +319,6 @@ Details
319
  - The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
320
  - Attention uses standard `1/sqrt(d)` scaling for best generation quality.
321
 
322
- ```markdown
323
  ## Installation
324
 
325
  This project uses `uv` for dependency management.
@@ -335,7 +334,6 @@ uv sync
335
 
336
  # 3. (Optional) Add the torch group if you plan to customize/train models
337
  uv sync --extra torch
338
- ```
339
 
340
  ### Without uv
341
  If you prefer pip/venv, a `requirements.txt` is provided:
@@ -346,7 +344,6 @@ pip install -r requirements.txt
346
  ```
347
 
348
  > The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
349
- ```
350
 
351
  ## MLX Inference Examples (safetensors)
352
 
@@ -377,7 +374,7 @@ This runtime mirrors the functional details of the released weights so they load
377
  - Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
378
 
379
  - Template and decoding
380
- - Provide a Jinja chat template for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
381
  - Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
382
 
383
  # Model Details
@@ -436,7 +433,7 @@ Compared to existing fully open-source models, MobileLLM-R1 950M model achieves
436
  # How to use
437
 
438
  To load the pretrained model for further finetuning or evaluation:
439
- ```bash
440
  from transformers import AutoModelForCausalLM, AutoTokenizer
441
  tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
442
  model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
@@ -467,7 +464,17 @@ Flags in `inference.py`
467
 
468
  See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
469
 
470
- Transformers
 
 
 
 
 
 
 
 
 
 
471
 
472
  ```py
473
  from transformers import pipeline
 
319
  - The loader maps HF weight names to MLX module names and detects the MLP variant from weight keys to ensure correct layer wiring.
320
  - Attention uses standard `1/sqrt(d)` scaling for best generation quality.
321
 
 
322
  ## Installation
323
 
324
  This project uses `uv` for dependency management.
 
334
 
335
  # 3. (Optional) Add the torch group if you plan to customize/train models
336
  uv sync --extra torch
 
337
 
338
  ### Without uv
339
  If you prefer pip/venv, a `requirements.txt` is provided:
 
344
  ```
345
 
346
  > The `torch` extra is only required if you intend to fine-tune or swap model back-ends; the default installation already supports inference.
 
347
 
348
  ## MLX Inference Examples (safetensors)
349
 
 
374
  - Map HF names to MLX names during load: `model.embed_tokens`→`tok_embeddings`, layer/attn/norm renames, `mlp.`→`feed_forward.`, `model.norm`→`norm`.
375
 
376
  - Template and decoding
377
+ - The provided Jinja chat template is supported for parity with HF chat usage, but allow `--disable-chat-template` for raw prompting. Multiple EOS IDs are supported.
378
  - Sampling: temperature, top‑p, and greedy; optional repetition/frequency penalties; math helpers `--final-only/--stop-at-boxed/--extract-boxed` to keep answers concise.
379
 
380
  # Model Details
 
433
  # How to use
434
 
435
  To load the pretrained model for further finetuning or evaluation:
436
+ ```python
437
  from transformers import AutoModelForCausalLM, AutoTokenizer
438
  tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-R1-950M")
439
  model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-R1-950M")
 
464
 
465
  See also: the “MLX Runtime (Apple silicon) — Added Files & Usage” section above for more examples and notes.
466
 
467
+ ## Inference (MLX-LM)
468
+
469
+ Two mlx-lm models are also provided, a conversion and a dynamic 4 bit quantization. code to reproduce and a handy inference runtime are provided in custom_mlx_lm/. After installation the following examples should work (I am forgetting, you may need to first copy the model into mlx_lm/ as `llama4_text.py`)
470
+
471
+ ```bash
472
+ mobilellm-infer --model-path MobileLLM-R1-950M-mixed-4bit-mlx --prompt "What is the nearest prime to 9^2?
473
+
474
+ mobilellm-infer --model-path MobileLLM-R1-950M-mlx/ --prompt "What is the nearest prime to 9^2?"
475
+ ```
476
+
477
+ ## Transformers
478
 
479
  ```py
480
  from transformers import pipeline