--- license: mit base_model: microsoft/Phi-4-mini-instruct tags: - openvino - npu - intel - int4 - symmetric - quantization - phi-4 language: - en pipeline_tag: text-generation --- # Phi-4-mini-instruct INT4_SYM for Intel NPU 🎉 **First NPU-optimized Phi-4-mini model with correct quantization for Intel NPU!** ## Model Description This is microsoft/Phi-4-mini-instruct (2.6B parameters) converted to OpenVINO IR format with **NPU-specific INT4 symmetric quantization**. ### Key Difference from Standard OpenVINO Models **Critical Discovery:** Intel NPU requires **INT4_SYM** (symmetric, channel-wise) quantization, not the INT4_ASYM (asymmetric, grouped) used by standard OpenVINO pre-converted models. | Quantization Type | NPU Compatibility | |-------------------|-------------------| | INT4_ASYM (group_size=64) | ❌ FAILS (MatMul errors) | | INT4_SYM (channel-wise) | ✅ WORKS (this model) | ## Quantization Details - **Method:** INT4_SYM (symmetric) - **Group size:** -1 (channel-wise, not grouped) - **Calibration:** AWQ + scale_estimation on wikitext2 dataset - **Distribution:** 84% INT4_SYM (128 layers), 16% INT8_ASYM (1 layer) - **Size:** 2.13 GB ## Performance on Intel NPU **Tested on Intel Core Ultra 7 155H (NPU driver v32.0.100.4297):** - **Speed:** 6.8 tok/s - **Compilation:** 68.5s - **Inference:** Stable, production-ready **Comparison to other models on same hardware (Intel Core Ultra 7 155H):** - **Qwen2.5-1.5B-Instruct (INT4_SYM):** 10.7 tok/s (0.87 GB) - Baseline performance - **Phi-4-mini-instruct (INT4_SYM):** 6.8 tok/s (2.13 GB) - 73% more parameters, reasoning capabilities - **Performance ratio:** ~64% of Qwen speed, but significantly more capable model ## Usage ### Requirements ```bash pip install openvino-genai huggingface-hub ``` ### Python API ```python from openvino_genai import LLMPipeline # Load and run on Intel NPU pipe = LLMPipeline("AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov", device="NPU") # Generate text response = pipe.generate("Explain quantum computing:", max_new_tokens=100) print(response) ``` ### Streaming ```python for token in pipe.generate("Write a story:", max_new_tokens=200, stream=True): print(token, end='', flush=True) ``` ## Why This Matters Standard OpenVINO Phi-4 models (e.g., `OpenVINO/Phi-4-mini-instruct-int4-ov`) use INT4_ASYM quantization which **fails NPU compilation** with errors like: ``` [ERROR] Channels count of input tensor shape and filter shape must be the same: 0 != 48 ``` This model uses the **correct NPU-optimized quantization** as specified in [Intel's NPU documentation](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html): ```bash optimum-cli export openvino -m microsoft/Phi-4-mini-instruct \ --weight-format int4 \ --sym \ # Symmetric (key for NPU!) --group-size -1 \ # Channel-wise (not grouped!) --awq --scale-estimation \ --dataset wikitext2 ``` ## Model Capabilities - **Instruction following:** Fine-tuned for chat/instruction tasks - **Reasoning:** Enhanced reasoning capabilities (Phi-4 series) - **Context length:** 4096 tokens - **NPU acceleration:** Full hardware offload to Intel NPU ## Hardware Requirements - **Intel NPU:** Core Ultra 7 155H (tested), or other NPU 3720/4000 series - **Driver:** v32.0.100.4297 or newer - **OpenVINO:** 2025.3.0 or newer - **Memory:** ~3 GB for model + inference ## Limitations - **NPU only:** This model is quantized specifically for Intel NPU - **Speed trade-off:** 6.8 tok/s vs Qwen2.5-1.5B @ 10.7 tok/s on Intel Core Ultra 7 155H - **Size vs capability:** Larger model (2.13 GB) but enhanced reasoning and instruction-following - **Hardware specific:** Performance validated on Intel Core Ultra 7 155H NPU ## Citation If you use this model, please cite: ```bibtex @misc{phi4-mini-npu-optimized, title={Phi-4-mini-instruct INT4_SYM for Intel NPU}, author={OpenVINO Community}, year={2025}, howpublished={\url{https://huggingface.co/AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov}}, } ``` ## Acknowledgments - **Base model:** Microsoft Phi-4-mini-instruct - **Framework:** Intel OpenVINO - **Quantization:** NNCF (Neural Network Compression Framework) - **Discovery:** Community finding on NPU quantization requirements ## License MIT (following base model license) ## Model Card Contact For issues or questions about NPU compatibility, please open an issue on the model repository. --- **Note:** This model demonstrates the importance of quantization method selection for hardware-specific optimization. Always verify quantization parameters match target hardware requirements!