Kimi-Linear-48B-A3B-Instruct · MXFP4 · 32-Group (MLX)
moonshotai/Kimi-Linear-48B-A3B-Instruct base model that has been quantizedwith
mlx-lm 0.28.4 into an MXFP4 format using group size 32 (with selective 8-bit gates for MoE stability).The weights and auxiliary files live in this folder and are compatible with the MLX Runtime on Apple Silicon.
Model Summary
- Architecture: KimiLinear MoE decoder-only transformer (27 layers, hidden size 2304, 32 attention heads, 256 experts, 8 experts/token) as defined in
config.json. - Context length: Configured for ~1M-token windows via linear attention blocks (effective window depends on runtime memory and the KV budget you set with
--max-kv-size). - Tokenizer:
tiktoken-based BPE (tokenizer_config.json,tiktoken.model) with special tokens defined inside the tokenizer files instead of hard-coded IDs. - Chat template: See
chat_template.jinjafor the multi-turn schema that mirrors the official Kimi tool-call format. - License: MIT, matching the upstream
moonshotai/Kimi-Linear-48B-A3B-Instructrelease.
Quantization Details
- Tooling:
mlx-lm (>=0.28.4)viapython3 -m mlx_lm.convert -qtargeting MXFP4 weights. - Format: MXFP4 4-bit packing with group size 32 across major linear layers (mode-enforced requirement for MXFP4).
- Exceptions: Mixture-of-experts gate projections remain at 8-bit / group size 64 for routing stability, as recorded in
quantization_config. - Shard layout: 5×
model-0000n-of-00005.safetensorsplusmodel.safetensors.index.jsonfor streaming loads. - Memory: Expect roughly 26–29 GB of Apple Silicon unified memory for the weights; KV cache usage scales with context length.
Quantization snippet (abbreviated from config.json):
"quantization_config": {
"group_size": 32,
"bits": 4,
"mode": "mxfp4",
"model.layers.1.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.2.mlp.gate": {"group_size": 64, "bits": 8},
"model.layers.3.mlp.gate": {"group_size": 64, "bits": 8},
"...": "additional gate entries continue through layer 26"
}
Files Included
| File | Purpose |
|---|---|
config.json, generation_config.json, configuration_kimi.py |
Hugging Face config + custom MLX config class. |
model-0000*-of-00005.safetensors, model.safetensors.index.json |
Quantized MXFP4 shards. |
modeling_kimi.py |
Custom model definition (inherits KimiLinearForCausalLM). |
tokenizer_config.json, special_tokens_map.json, tiktoken.model, tokenization_kimi.py |
Tokenizer assets. |
chat_template.jinja |
HF chat template for AutoTokenizer.apply_chat_template. |
README.md, README-kr.md |
English/Korean model cards. |
Intended Use & Limitations
- Use cases: Multilingual assistant/chat, tool invocation, long-context retrieval-augmented generation on Apple Silicon hardware.
- Not for: Decisions requiring guarantees (medical, legal, financial) without human review; handling unfiltered harmful instructions.
- Safety: Mirrors the base model’s safety profile—apply additional filtering and RLHF layers if deploying to end users.
- Security:
modeling_kimi.pydefines custom modules, so you must use--trust-remote-codewith the CLI andtrust_remote_code=Truewith the Python API. Handle sensitive data in an offline, isolated environment.
How to Use (MLX)
Install MLX tooling (macOS 13.6+ on Apple Silicon):
pip install -U mlx-lm # or: pip install -U "git+https://github.com/ml-explore/mlx-lm.git@main" # Offline cache only: # HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...Chat CLI (template + stop rules auto-applied):
mlx_lm.chat \ --model /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX \ --trust-remote-code \ --max-tokens 512 --temperature 0.7 --top-p 0.9For ≥256K contexts, consider
--max-kv-size 262144(and scale further as needed).Programmatic usage:
from mlx_lm import load, generate from mlx_lm.sample_utils import make_sampler, make_logits_processors model, tok = load( "/path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX", trust_remote_code=True, ) messages = [{"role": "user", "content": "Kimi Linear 구조를 간단히 요약해줘."}] prompt = tok.apply_chat_template(messages, add_generation_prompt=True) sampler = make_sampler(temperature=0.7, top_p=0.9) procs = make_logits_processors(repetition_penalty=1.1, repetition_context_size=64) print( generate( model, tok, prompt, max_tokens=512, sampler=sampler, logits_processors=procs, ) )(Set
HF_HUB_OFFLINE=1and/orTRANSFORMERS_OFFLINE=1before running if you need hub-less operation.)
Conversion Notes
- Source checkpoint:
moonshotai/Kimi-Linear-48B-A3B-Instruct(synced 2025-11-07 UTC). - Conversion command actually executed:
python3 -m mlx_lm.convert \ --hf-path moonshotai/Kimi-Linear-48B-A3B-Instruct \ --q-bits 4 -q \ --group-size 32 \ -o Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX - Post-quantization integrity verified via
mlx_lm.chatsanity prompts andsafetensorschecksum inspection.
Integrity & Verification
After upload, verify shard integrity locally:
cd /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
shasum -a 256 model-*.safetensors > SHA256SUMS
shasum -c SHA256SUMS
Additional Tips
- Set
--trust-remote-codewhenever loading this repository through MLX CLIs to ensure the custom layers inmodeling_kimi.pyare registered. - Leverage prompt caching (
mlx_lm.cache_prompt) plus--max-kv-sizefor reliable ≥1M-token experiments without exhausting unified memory.
Acknowledgments
- Moonshot AI — Thank you for releasing the Kimi family and openly documenting the Kimi Linear architecture. Reference: Moonshot AI GitHub, Kimi Linear repo, and the official technical report.
- Apple Machine Learning Research — Deep gratitude for continuously evolving MLX / MLX-LM so Apple Silicon users can keep learning and iterating. See MLX and MLX-LM.
- MLX Community on Hugging Face — Thanks for sharing MLX-ready weights and examples at lightning speed; they directly inspired this conversion flow. See mlx-community.
Citation
If you use this model, please cite both Moonshot AI and this quantized release in your research or product documentation.
- Downloads last month
- 242
Model tree for brabooObrabo/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
Base model
moonshotai/Kimi-Linear-48B-A3B-Instruct