💬 Chat Model "Zero" (Phi-2 2.7B + QLoRA Adapter)

This repository contains the QLoRA adapter for creating "Zero", a specialized instruction-following AI assistant fine-tuned from microsoft/phi-2.

This model serves as the core component of a full-stack AI engineering and MLOps workflow project, covering the complete lifecycle from fine-tuning (with W&B tracking) to local inference and system integration using FastAPI.

🧩 Model Adapter: adhafajp/phi2-qlora-zero-chat
⚙️ Full FastAPI Project (Main Portfolio): GitHub – ZeroChat

🚀 Project Overview

Zero is designed as a lightweight, memory-efficient conversational model optimized for reasoning, instruction-following, and question-answering tasks.

Key Features:

🧠 Fine-tuned using QLoRA — efficient, low-resource adaptation of Phi-2
⚙️ Backend: Asynchronous FastAPI inference server with streaming responses
💬 Frontend: Interactive chat interface built with HTML, TailwindCSS, and JavaScript (via Server-Sent Events)
🔍 Experiment tracking: Integrated Weights & Biases (W&B) logging for training runs
🔐 Local deployment-ready: Lightweight, easily containerized for offline use

🧩 Training Details

Component	Description
Base Model	`microsoft/phi-2`
Method	QLoRA (Quantized LoRA Fine-Tuning)
Language	English only
Precision	4-bit (NF4)
Optimizer	Paged AdamW 8-bit
Frameworks	`transformers`, `peft`, `bitsandbytes`, `fastapi`

Dataset Composition

The adapter was trained on a curated blend of English datasets:

alpaca_cleaned → general-purpose instruction-following
squad_v2 → question answering and reading comprehension
custom_persona (283 samples) → gives Zero its distinct assistant identity

🖥️ Training Hardware

Fine-tuning was performed entirely on a consumer-grade laptop:

Laptop: Acer Nitro V15
GPU: NVIDIA RTX 2050 Mobile (4 GB VRAM)
CPU: Intel Core i5-13420H
RAM: 16 GB
Quantization: 4-bit NF4
Strategy: Low VRAM setup using gradient accumulation, packing, and LoRA adapters

This demonstrates that Phi-2 can be fine-tuned effectively even on low-VRAM devices.

🔧 Integration Example

A complete local deployment example (FastAPI backend + chat frontend) is available at the main project repository: 👉 GitHub – ZeroChat

This repository demonstrates how to integrate this adapter with:

🔹 A FastAPI inference server (supports streaming responses)
🔹 A lightweight HTML/Tailwind chat UI
🔹 Simple local setup and environment configuration for experimentation or portfolio demonstration

📈 Training Phases Summary

The fine tuning consist of multiple stage experiment

Stage 1:

Phase	Summary	Runtime
1A	Initial fine-tune (canceled due to overfitting)	11h 50m
1B	Full 2-epoch fine-tune on Alpaca + SQuADv2 + persona	5d 11h 50m
1C	Small re-train (underfit)	19h
1D / 1D-A / 1E	Refinement attempts with packing & oversampling	~3d total
1F	Final adapter re-train from 1B (expanded persona dataset, balanced oversampling)	1d 5h

Stage 2:

After gathering all the insights from the initial experiments (1A-1F), fine-tuning was restarted completely from scratch. By applying all the lessons learned, this new training process achieved better and more balanced performance in just 1s 21h. The adapter released in this repository is the result of this final, optimized training.

Phase	Summary	Runtime
1	Fine-tune again from scratch(from base model) by applying all the insights from previous experiments.	1d 21h

📊 W&B Log (Phase 1F): wandb.ai/VoidNova/.../

📊 W&B Log (Final): wandb.ai/VoidNova/.../runs/rx5fih5v

🧠 How to Use

⚠️ This is a LoRA adapter, not a full model.
You must load the base model (microsoft/phi-2) and apply this adapter on top of it.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

adapter_path = "adhafajp/phi2-qlora-zero-chat"
base_model_path = "microsoft/phi-2"

# Quantization configuration
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

print(f"Loading base model from: {base_model_path}")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

print(f"Loading tokenizer from: {adapter_path}")
tokenizer = AutoTokenizer.from_pretrained(
    adapter_path,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

base_model.resize_token_embeddings(len(tokenizer))

print(f"Applying QLoRA adapter from: {adapter_path}...")
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

print("Model is ready to use!")

# --- INFERENCE EXAMPLE ---

DEFAULT_SYSTEM = "You are Zero, a helpful assistant."
PROMPT_FORMAT = """<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""

instruction = "What is QLoRA and how does it work?"
prompt_text = PROMPT_FORMAT.format(
    system_prompt=DEFAULT_SYSTEM,
    instruction=instruction
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
prompt_token_count = inputs["input_ids"].shape[1]

print(f"\nGenerating response for: '{instruction}'")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=768,
        repetition_penalty=1.1,
        do_sample=False,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>"),
        pad_token_id=tokenizer.pad_token_id,
    )

generated_tokens = outputs[0][prompt_token_count:]
generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=False)

cut_index = len(generated_text)
for stop_token in ["<|endoftext|>", "<|im_end|>"]:
    if stop_token in generated_text:
        cut_index = min(cut_index, generated_text.index(stop_token))

final_answer = generated_text[:cut_index].strip()

print(f"Model response:\n{final_answer}")

🪶 Example Prompts

"Who are you?" "How to be success?"

🧠 Example with RAG Context

"CONTEXT:---Zinc is an essential mineral perceived by the public today as being of ''exceptional biologic and public health importance'', especially regarding prenatal and postnatal development. Zinc deficiency affects about two billion people in the developing world and is associated with many diseases. In children it causes growth retardation, delayed sexual maturation, infection susceptibility, and diarrhea. Enzymes with a zinc atom in the reactive center are widespread in biochemistry, such as alcohol dehydrogenase in humans. Consumption of excess zinc can cause ataxia, lethargy and copper deficiency.---QUESTION:How many people are affected by zinc deficiency?"

Acknowledgements & License

This project builds upon several outstanding open-source contributions:

Base Model: This adapter is fine-tuned from microsoft/phi-2, licensed under the MIT License.
Copyright (c) 2023 Microsoft.
Libraries: Powered by transformers, peft, and bitsandbytes from Hugging Face 🤗, as well as torch from PyTorch — all permissively licensed (Apache 2.0 or MIT).
This Adapter & Code: Released under the MIT License.
You are free to use, modify, and distribute it with proper attribution.

Downloads last month: 73

Model tree for adhafajp/phi2-qlora-zero-chat

Base model

microsoft/phi-2

Adapter

(944)

this model

adhafajp
/

phi2-qlora-zero-chat