Kimi-Linear-48B-A3B-Instruct · MXFP4 · 32-Group (MLX)

moonshotai/Kimi-Linear-48B-A3B-Instruct base model that has been quantized
with mlx-lm 0.28.4 into an MXFP4 format using group size 32 (with selective 8-bit gates for MoE stability).
The weights and auxiliary files live in this folder and are compatible with the MLX Runtime on Apple Silicon.

한국어 | English

Model Summary

  • Architecture: KimiLinear MoE decoder-only transformer (27 layers, hidden size 2304, 32 attention heads, 256 experts, 8 experts/token) as defined in config.json.
  • Context length: Configured for ~1M-token windows via linear attention blocks (effective window depends on runtime memory and the KV budget you set with --max-kv-size).
  • Tokenizer: tiktoken-based BPE (tokenizer_config.json, tiktoken.model) with special tokens defined inside the tokenizer files instead of hard-coded IDs.
  • Chat template: See chat_template.jinja for the multi-turn schema that mirrors the official Kimi tool-call format.
  • License: MIT, matching the upstream moonshotai/Kimi-Linear-48B-A3B-Instruct release.

Quantization Details

  • Tooling: mlx-lm (>=0.28.4) via python3 -m mlx_lm.convert -q targeting MXFP4 weights.
  • Format: MXFP4 4-bit packing with group size 32 across major linear layers (mode-enforced requirement for MXFP4).
  • Exceptions: Mixture-of-experts gate projections remain at 8-bit / group size 64 for routing stability, as recorded in quantization_config.
  • Shard layout: 5× model-0000n-of-00005.safetensors plus model.safetensors.index.json for streaming loads.
  • Memory: Expect roughly 26–29 GB of Apple Silicon unified memory for the weights; KV cache usage scales with context length.

Quantization snippet (abbreviated from config.json):

"quantization_config": {
  "group_size": 32,
  "bits": 4,
  "mode": "mxfp4",
  "model.layers.1.mlp.gate": {"group_size": 64, "bits": 8},
  "model.layers.2.mlp.gate": {"group_size": 64, "bits": 8},
  "model.layers.3.mlp.gate": {"group_size": 64, "bits": 8},
  "...": "additional gate entries continue through layer 26"
}

Files Included

File Purpose
config.json, generation_config.json, configuration_kimi.py Hugging Face config + custom MLX config class.
model-0000*-of-00005.safetensors, model.safetensors.index.json Quantized MXFP4 shards.
modeling_kimi.py Custom model definition (inherits KimiLinearForCausalLM).
tokenizer_config.json, special_tokens_map.json, tiktoken.model, tokenization_kimi.py Tokenizer assets.
chat_template.jinja HF chat template for AutoTokenizer.apply_chat_template.
README.md, README-kr.md English/Korean model cards.

Intended Use & Limitations

  • Use cases: Multilingual assistant/chat, tool invocation, long-context retrieval-augmented generation on Apple Silicon hardware.
  • Not for: Decisions requiring guarantees (medical, legal, financial) without human review; handling unfiltered harmful instructions.
  • Safety: Mirrors the base model’s safety profile—apply additional filtering and RLHF layers if deploying to end users.
  • Security: modeling_kimi.py defines custom modules, so you must use --trust-remote-code with the CLI and trust_remote_code=True with the Python API. Handle sensitive data in an offline, isolated environment.

How to Use (MLX)

  1. Install MLX tooling (macOS 13.6+ on Apple Silicon):

    pip install -U mlx-lm  # or: pip install -U "git+https://github.com/ml-explore/mlx-lm.git@main"
    # Offline cache only:
    # HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 ...
    
  2. Chat CLI (template + stop rules auto-applied):

    mlx_lm.chat \
      --model /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX \
      --trust-remote-code \
      --max-tokens 512 --temperature 0.7 --top-p 0.9
    

    For ≥256K contexts, consider --max-kv-size 262144 (and scale further as needed).

  3. Programmatic usage:

    from mlx_lm import load, generate
    from mlx_lm.sample_utils import make_sampler, make_logits_processors
    
     model, tok = load(
        "/path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX",
        trust_remote_code=True,
     )
    
    messages = [{"role": "user", "content": "Kimi Linear 구조를 간단히 요약해줘."}]
    prompt = tok.apply_chat_template(messages, add_generation_prompt=True)
    
    sampler = make_sampler(temperature=0.7, top_p=0.9)
    procs = make_logits_processors(repetition_penalty=1.1, repetition_context_size=64)
    
    print(
        generate(
            model,
            tok,
            prompt,
            max_tokens=512,
            sampler=sampler,
            logits_processors=procs,
        )
    )
    

    (Set HF_HUB_OFFLINE=1 and/or TRANSFORMERS_OFFLINE=1 before running if you need hub-less operation.)

Conversion Notes

  • Source checkpoint: moonshotai/Kimi-Linear-48B-A3B-Instruct (synced 2025-11-07 UTC).
  • Conversion command actually executed:
    python3 -m mlx_lm.convert \
      --hf-path moonshotai/Kimi-Linear-48B-A3B-Instruct \
      --q-bits 4 -q \
      --group-size 32 \
      -o Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
    
  • Post-quantization integrity verified via mlx_lm.chat sanity prompts and safetensors checksum inspection.

Integrity & Verification

After upload, verify shard integrity locally:

cd /path/to/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX
shasum -a 256 model-*.safetensors > SHA256SUMS
shasum -c SHA256SUMS

Additional Tips

  • Set --trust-remote-code whenever loading this repository through MLX CLIs to ensure the custom layers in modeling_kimi.py are registered.
  • Leverage prompt caching (mlx_lm.cache_prompt) plus --max-kv-size for reliable ≥1M-token experiments without exhausting unified memory.

Acknowledgments

  • Moonshot AI — Thank you for releasing the Kimi family and openly documenting the Kimi Linear architecture. Reference: Moonshot AI GitHub, Kimi Linear repo, and the official technical report.
  • Apple Machine Learning Research — Deep gratitude for continuously evolving MLX / MLX-LM so Apple Silicon users can keep learning and iterating. See MLX and MLX-LM.
  • MLX Community on Hugging Face — Thanks for sharing MLX-ready weights and examples at lightning speed; they directly inspired this conversion flow. See mlx-community.

Citation

If you use this model, please cite both Moonshot AI and this quantized release in your research or product documentation.

Downloads last month
242
Safetensors
Model size
49B params
Tensor type
U8
·
U32
·
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brabooObrabo/Kimi-Linear-48B-A3B-Instruct-MXFP4-GS32-MLX

Quantized
(16)
this model