---
license: mit
base_model: microsoft/Phi-4-mini-instruct
tags:
- openvino
- npu
- intel
- int4
- symmetric
- quantization
- phi-4
language:
- en
pipeline_tag: text-generation
---

# Phi-4-mini-instruct INT4_SYM for Intel NPU

🎉 **First NPU-optimized Phi-4-mini model with correct quantization for Intel NPU!**

## Model Description

This is microsoft/Phi-4-mini-instruct (2.6B parameters) converted to OpenVINO IR format with **NPU-specific INT4 symmetric quantization**. 

### Key Difference from Standard OpenVINO Models

**Critical Discovery:** Intel NPU requires **INT4_SYM** (symmetric, channel-wise) quantization, not the INT4_ASYM (asymmetric, grouped) used by standard OpenVINO pre-converted models.

| Quantization Type | NPU Compatibility |
|-------------------|-------------------|
| INT4_ASYM (group_size=64) | ❌ FAILS (MatMul errors) |
| INT4_SYM (channel-wise) | ✅ WORKS (this model) |

## Quantization Details

- **Method:** INT4_SYM (symmetric)
- **Group size:** -1 (channel-wise, not grouped)
- **Calibration:** AWQ + scale_estimation on wikitext2 dataset
- **Distribution:** 84% INT4_SYM (128 layers), 16% INT8_ASYM (1 layer)
- **Size:** 2.13 GB

## Performance on Intel NPU

**Tested on Intel Core Ultra 7 155H (NPU driver v32.0.100.4297):**
- **Speed:** 6.8 tok/s
- **Compilation:** 68.5s
- **Inference:** Stable, production-ready

**Comparison to other models on same hardware (Intel Core Ultra 7 155H):**
- **Qwen2.5-1.5B-Instruct (INT4_SYM):** 10.7 tok/s (0.87 GB) - Baseline performance
- **Phi-4-mini-instruct (INT4_SYM):** 6.8 tok/s (2.13 GB) - 73% more parameters, reasoning capabilities
- **Performance ratio:** ~64% of Qwen speed, but significantly more capable model

## Usage

### Requirements

```bash
pip install openvino-genai huggingface-hub
```

### Python API

```python
from openvino_genai import LLMPipeline

# Load and run on Intel NPU
pipe = LLMPipeline("AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov", device="NPU")

# Generate text
response = pipe.generate("Explain quantum computing:", max_new_tokens=100)
print(response)
```

### Streaming

```python
for token in pipe.generate("Write a story:", max_new_tokens=200, stream=True):
    print(token, end='', flush=True)
```

## Why This Matters

Standard OpenVINO Phi-4 models (e.g., `OpenVINO/Phi-4-mini-instruct-int4-ov`) use INT4_ASYM quantization which **fails NPU compilation** with errors like:

```
[ERROR] Channels count of input tensor shape and filter shape must be the same: 0 != 48
```

This model uses the **correct NPU-optimized quantization** as specified in [Intel's NPU documentation](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html):

```bash
optimum-cli export openvino -m microsoft/Phi-4-mini-instruct \
    --weight-format int4 \
    --sym \                    # Symmetric (key for NPU!)
    --group-size -1 \          # Channel-wise (not grouped!)
    --awq --scale-estimation \
    --dataset wikitext2
```

## Model Capabilities

- **Instruction following:** Fine-tuned for chat/instruction tasks
- **Reasoning:** Enhanced reasoning capabilities (Phi-4 series)
- **Context length:** 4096 tokens
- **NPU acceleration:** Full hardware offload to Intel NPU

## Hardware Requirements

- **Intel NPU:** Core Ultra 7 155H (tested), or other NPU 3720/4000 series
- **Driver:** v32.0.100.4297 or newer
- **OpenVINO:** 2025.3.0 or newer
- **Memory:** ~3 GB for model + inference

## Limitations

- **NPU only:** This model is quantized specifically for Intel NPU
- **Speed trade-off:** 6.8 tok/s vs Qwen2.5-1.5B @ 10.7 tok/s on Intel Core Ultra 7 155H
- **Size vs capability:** Larger model (2.13 GB) but enhanced reasoning and instruction-following
- **Hardware specific:** Performance validated on Intel Core Ultra 7 155H NPU

## Citation

If you use this model, please cite:

```bibtex
@misc{phi4-mini-npu-optimized,
  title={Phi-4-mini-instruct INT4_SYM for Intel NPU},
  author={OpenVINO Community},
  year={2025},
  howpublished={\url{https://huggingface.co/AhtnaGlen/phi-4-mini-instruct-int4-sym-npu-ov}},
}
```

## Acknowledgments

- **Base model:** Microsoft Phi-4-mini-instruct
- **Framework:** Intel OpenVINO
- **Quantization:** NNCF (Neural Network Compression Framework)
- **Discovery:** Community finding on NPU quantization requirements

## License

MIT (following base model license)

## Model Card Contact

For issues or questions about NPU compatibility, please open an issue on the model repository.

---

**Note:** This model demonstrates the importance of quantization method selection for hardware-specific optimization. Always verify quantization parameters match target hardware requirements!