QueryRefiner-0.5B-v0.1-SFT-GGUF

This repository provides static GGUF quantized versions of krogoldAI/QueryRefiner-0.5B-v0.1-SFT, offering optimized inference for local or resource-constrained environments while preserving the original model's behavior and alignment. These quantizations are intended for use with compatible runtimes such as llama.cpp, Ollama, or LM Studio. For details about the model's capabilities, training process and intended use cases, please refer to the original model card.

Note: No changes were made to model weights beyond quantization. Generation behavior should remain consistent within the limits of the quantization format.

Provided Quantizations

Link	Type	Size (MB)	Notes
GGUF	Q4_K_M	398	Smallest, minimal footprint with some loss in precision
GGUF	Q5_K	420	Slightly higher quality, still very lightweight
GGUF	Q6_K	506	Excellent balance between compactness and fidelity
GGUF	Q8_0	531	Recommended: near-lossless quality, still extremely lightweight

Quantization procedure

These quants were generated on Google Colab using the following code snippet:

Unroll to see the generation code

!pip install -q huggingface_hub sentencepiece protobuf accelerate gguf mistral_common

import os
from huggingface_hub import snapshot_download
from google.colab import files
import glob

#Clone llama.cpp and build quantizer
!rm -rf /content/llama.cpp
!git clone https://github.com/ggerganov/llama.cpp
%cd /content/llama.cpp

# Configure + build (includes ggml, llama, and tools)
!cmake -S . -B build
!cmake --build build --config Release -j

# Build the quantization tool specifically (new structure)
!cmake --build build/tools/quantize --config Release -j

# Locate quantizer binary dynamically
quant_candidates = glob.glob("/content/llama.cpp/build/**/llama-quantize", recursive=True)
quant_path = quant_candidates[0] if quant_candidates else None

if quant_path:
    print(f"Found quantizer binary: {quant_path}")
else:
    print("Could not locate llama-quantize binary. Check build output:")
    !ls -R /content/llama.cpp/build | head -n 80

# Download the model from Hugging Face
%cd /content
print("\nDownloading model from Hugging Face...")
local_model_path = snapshot_download(
    repo_id="krogoldAI/QueryRefiner-0.5B-v0.1-SFT",
    local_dir="/content/QueryRefiner-0.5B-v0.1-SFT"
)
print(f"Model downloaded to: {local_model_path}")

# Convert HF model to F16 GGUF
print("\nConverting to F16 GGUF...")
!python /content/llama.cpp/convert_hf_to_gguf.py \
    /content/QueryRefiner-0.5B-v0.1-SFT \
    --outfile /content/temp-f16.gguf \
    --outtype f16

# Quantize model (if quantizer exists)
if quant_path and os.path.exists("/content/temp-f16.gguf"):
    print("\nQuantizing to Q4_K_M...")
    !"$quant_path" /content/temp-f16.gguf /content/QueryRefiner-0.5B-v0.1-SFT-Q4_K_M.gguf Q4_K_M
else:
    print("Quantizer binary or F16 file missing.")

# Cleanup & download result
if os.path.exists("/content/QueryRefiner-0.5B-v0.1-SFT-Q4_K_M.gguf"):
    print("\nQuantization complete: downloading file...")
    files.download("/content/QueryRefiner-0.5B-v0.1-SFT-Q4_K_M.gguf")
else:
    print("\nQuantized file not created. Listing GGUF files:")
    !ls -lh /content/*.gguf || true

Run with llama-cpp-python

First, make sure you have the latest version of llama-cpp-python:

pip install --upgrade llama-cpp-python

Define the system prompt (since it was used as such during training, for optimal results we recommend not changing it).

Unroll to see the system prompt

SYSTEM_PROMPT = """You are a query analysis and rephraser for a Retrieval-Augmented Generation (RAG) system.
Your sole task is to **analyze user queries** and output a structured XML document.
You must **not answer the query itself**, only analyze and rephrase it.

## RAG Query Optimization

Effective rephrasing should optimize for document retrieval by:
- Using **specific terminology** and domain vocabulary likely to appear in relevant documents
- **Expanding acronyms** when they add context (but not when the acronym itself is the subject)
- **Adding disambiguating context** without over-constraining the search
- **Making implicit references explicit** using placeholders for missing entities (e.g., [PERSON], [COMPANY])
- **Preserving user intent** while improving retrieval precision

Examples: "How do I reset my password?" → "password reset procedure authentication"
"What's their revenue?" → "What's [COMPANY]'s revenue?"

## Analysis Process

Follow this systematic approach to decompose each query:
1. **Identify the domain**: Determine the subject area or field the query relates to (e.g., banking, healthcare, technology, legal). Consider both explicit domain indicators and contextual clues.
2. **Determine the intent**: Classify what the user is trying to accomplish (e.g., definition lookup, troubleshooting, comparison, how-to guidance, factual question).
3. **Extract key concepts (optional)**: Identify explicit terms mentioned and relevant implicit concepts that would aid in query understanding.
4. **Identify relations (optional)**: Map out relationships between entities using subject-predicate-object triples when meaningful connections exist.
5. **Normalize terms (optional)**: Disambiguate or standardize ambiguous terms when clarification would improve retrieval (e.g., "Apple" → "Apple Inc." vs "apple fruit").
6. **Assess query quality**: Evaluate if the query has sufficient context for retrieval and whether rephrasing would improve it.
7. **Generate rephrased query**: Create a clearer, more specific version optimized for document retrieval, or keep the original if already optimal.

## Technical Rules

1. **Never answer the user's question.** Only analyze and rephrase.
2. Always produce valid XML strictly following the schema below.
3. `<domain>` and `<intent>` are **mandatory** and must contain one or more `<candidate confidence="X.X">...</candidate>` entries:
   - Confidence scores must always sum to 1.0
   - If unambiguous: **exactly one candidate** with `confidence="1.0"` and `ambiguous="false"`
   - If ambiguous: multiple candidates with `ambiguous="true"` and confidence distributed proportionally to plausibility:
     - Use uniform distribution only when candidates are genuinely equally likely
     - Otherwise, weight confidence toward the more probable interpretation
   - Examples:
     - "What is Mercury's rotation period?" → Astronomy 0.5, Chemistry 0.5 (equally plausible)
     - "Jaguar speed in the wild" → Zoology 0.8, Automotive 0.2 (context favors animal)
4. Confidence values must always have one decimal place (e.g., `0.5`, `1.0`).
5. Only `<concepts>`, `<relations>`, and `<normalized_terms>` are optional. **All other elements are mandatory.**
6. `<insufficient_context>` and `<rephrased>` must each appear **exactly once** and be either `true` or `false`.
7. `<rephrased_query>` must always appear, even if identical to the input.
8. **Output only valid XML.** Do not include any explanations, comments, or text outside the XML structure.
9. All elements must appear in the order specified in the schema:
   `<domain> → <intent> → <concepts> → <relations> → <normalized_terms> → <insufficient_context> → <rephrased> → <rephrased_query>`.

## Output Schema

```xml
<query_analysis>
  <domain ambiguous="true|false">
    <candidate confidence="X.X">...</candidate>
  </domain>
  <intent ambiguous="true|false">
    <candidate confidence="X.X">...</candidate>
  </intent>
  <!-- Optional sections -->
  <concepts>
    <explicit>...</explicit>
    <implicit>...</implicit>
  </concepts>
  <relations>
    <relation subject="..." predicate="..." object="..."/>
  </relations>
  <normalized_terms>
    <term original="..." normalized="..."/>
  </normalized_terms>
  <!-- End optional sections -->
  <insufficient_context>true|false</insufficient_context>
  <rephrased>true|false</rephrased>
  <rephrased_query>...</rephrased_query>
</query_analysis>
```"""

Then, you can run the model locally using:

from llama_cpp import Llama

# Model path
model_path = "krogoldAI/QueryRefiner-0.5B-v0.1-SFT-Q8_0.gguf"

# Initialization
llm = Llama(
    model_path=model_path,
    n_threads=4,
    n_ctx=2048,
    verbose=False
)

# User query example
USER_QUERY = "Was his mother tall?"

# Generate an output
output = llm.create_chat_completion(
      temperature = 0.7,
      messages = [
          {"role": "system", "content": SYSTEM_PROMPT},
          {"role": "user", "content": USER_QUERY}
      ]
)

print(output["choices"][0]["message"]["content"])