QueryRefiner-0.5B-v0.1-SFT

Model Description

QueryRefiner-0.5B-v0.1-SFT is a specialized query analysis and rephrasing model fine-tuned from Qwen2.5-0.5B-Instruct for Retrieval-Augmented Generation (RAG) systems. The model has been trained to analyze user queries through a structured, ontology-inspired reasoning framework that produces XML outputs with domain classification, intent detection, conceptual analysis, and optimized query reformulations designed to improve document retrieval precision.

This model is part of a two-model series addressing a critical challenge in RAG systems: converting ambiguous, vague, or suboptimally-phrased user queries into retrieval-friendly formats through systematic decomposition and analysis. By employing a structured reasoning approach that explicitly identifies domains, intents, concepts, and relationships, the model produces higher-quality rephrasings informed by comprehensive query understanding.

Note: This model is currently in an experimental, beta stage. It represents an initial release focused on establishing baseline capabilities. Future iterations may address current limitations, expand domain coverage, or incorporate additional output features based on real-world deployment feedback.

Examples

Unroll to see Example #1

User query:

How can artificial intelligence be utilized to improve emergency response systems in urban environments?

Model output:

<query_analysis>
  <domain ambiguous="false">
    <candidate confidence="1.0">Technology</candidate>
  </domain>
  <intent ambiguous="false">
    <candidate confidence="1.0">Improving emergency response systems</candidate>
  </intent>
  <concepts>
    <explicit>artificial intelligence, emergency response systems, urban environments</explicit>
    <implicit>smart city, predictive analytics, data-driven decision making</implicit>
  </concepts>
  <relations>
    <relation subject="artificial intelligence" predicate="utilized to" object="improve emergency response systems"/>
    <relation subject="emergency response systems" predicate="in" object="urban environments"/>
  </relations>
  <insufficient_context>false</insufficient_context>
  <rephrased>true</rephrased>
  <rephrased_query>how artificial intelligence can be used to enhance emergency response capabilities in urban areas</rephrased_query>
</query_analysis>

Unroll to see Example #2

User query:

How have its fiscal policies regarding these things impacted its overall financial stability and future outlook, particularly concerning external market fluctuations?

Model output:

<query_analysis>
  <domain ambiguous="true">
    <candidate confidence="0.6">Economics</candidate>
    <candidate confidence="0.4">Finance</candidate>
  </domain>
  <intent ambiguous="true">
    <candidate confidence="0.7">Analysis</candidate>
    <candidate confidence="0.3">Comparison</candidate>
  </intent>
  <concepts>
    <explicit>fiscal policies, external market fluctuations, financial stability, future outlook</explicit>
    <implicit>[ENTITY], economic factors</implicit>
  </concepts>
  <relations>
    <relation subject="[ENTITY]" predicate="has" object="fiscal policies"/>
    <relation subject="[ENTITY]" predicate="impacts" object="financial stability"/>
    <relation subject="[ENTITY]" predicate="impacts" object="future outlook"/>
    <relation subject="[ENTITY]" predicate="causes" object="external market fluctuations"/>
  </relations>
  <insufficient_context>false</insufficient_context>
  <rephrased>true</rephrased>
  <rephrased_query>impact of [ENTITY]'s fiscal policies on financial stability and future outlook, particularly due to external market volatility</rephrased_query>
</query_analysis>

Unroll to see Example #3

User query:

How does she describe her emotional journey and the complexities of her interactions with him in that book?

Model output:

<query_analysis>
  <domain ambiguous="true">
    <candidate confidence="0.5">Literature</candidate>
    <candidate confidence="0.5">Psychology</candidate>
  </domain>
  <intent ambiguous="true">
    <candidate confidence="0.6">Analysis</candidate>
    <candidate confidence="0.4">Description</candidate>
  </intent>
  <concepts>
    <explicit>emotional journey, complex interactions</explicit>
    <implicit>[PERSON], [BOOK]</implicit>
  </concepts>
  <relations>
    <relation subject="[PERSON]" predicate="describes" object="complex emotions"/>
    <relation subject="[PERSON]" predicate="interacts with" object="him"/>
  </relations>
  <insufficient_context>true</insufficient_context>
  <rephrased>true</rephrased>
  <rephrased_query>How does [PERSON] describe their emotional journey and the intricacies of their interactions with [PERSON] in the book?</rephrased_query>
</query_analysis>

Intended Use

The model is designed to serve as a query analysis layer in RAG systems, where it processes user inputs before they are sent to retrieval components. It can handle queries with varying degrees of ambiguity and identify when queries lack sufficient context for effective retrieval. The structured XML output enables downstream systems to make informed decisions about how to handle each query based on confidence scores, ambiguity flags, and the rephrased version.

Typical deployment scenarios include conversational search systems, question-answering platforms, document retrieval services, and any application where understanding user intent and optimizing queries for semantic search is critical. The model's compact 0.5B parameter size makes it suitable for deployment in resource-constrained environments or as part of larger multi-component systems where latency is a concern.

Training Data

The model was trained on the krogoldAI/rag-query-analysis dataset, which contains 7,305 high-quality query-analysis pairs. This dataset contains queries carefully curated from three sources:

Approximately 20% of the training examples include queries with systematically introduced ambiguity at varying levels to ensure the model can handle realistic user inputs across the ambiguity spectrum.

The training data underwent rigorous quality assurance through a dual evaluation framework. Each example was validated for strict XML schema conformance and semantically evaluated using an LLM-as-a-judge protocol with six quality dimensions. Only examples achieving both perfect structural validity and high semantic quality scores were included in the final dataset, ensuring the model was trained exclusively on gold-standard examples.

Training Procedure

The model underwent full fine-tuning (not parameter-efficient methods like LoRA) of all parameters in Qwen2.5-0.5B-Instruct. Training was conducted over three epochs with a per-device batch size of 4 and gradient accumulation over 4 steps, yielding an effective batch size of 16. The learning rate was set to 2e-5 with a weight decay of 0.01 to prevent overfitting. Training was performed on an NVIDIA A100 SXM GPU.

The training code can be found here.

Model Capabilities

The model employs a systematic, ontology-inspired analysis framework that decomposes queries into structured XML representations. This analytical approach (which identifies domains, intents, concepts, relations, and ambiguities) aims to enhance the quality of the rephrased query output. For every query, it provides domain classification and intent detection with confidence scores that sum to 1.0, properly handling both unambiguous cases (single candidate with confidence 1.0) and ambiguous cases (multiple candidates with distributed confidence). The model can optionally extract explicit and implicit concepts, identify relations between entities using subject-predicate-object triples, and normalize ambiguous terms when disambiguation would improve retrieval.

The rephrasing capability focuses on retrieval optimization rather than query answering. The model transforms queries by using specific terminology likely to appear in relevant documents, expanding acronyms when contextually appropriate, adding disambiguating context, and making implicit references explicit through placeholder notation such as [PERSON] or [COMPANY]. Importantly, the model has learned to preserve already-optimal queries unchanged, recognizing when rephrasing would not improve retrieval effectiveness.

This structured analytical framework ensures that rephrasings are informed by comprehensive query understanding rather than surface-level transformations, leading to more semantically precise retrieval-optimized queries.

The default output schema is the following:

<query_analysis>
  <domain ambiguous="true|false">
    <candidate confidence="X.X">...</candidate>
  </domain>
  <intent ambiguous="true|false">
    <candidate confidence="X.X">...</candidate>
  </intent>
  <!-- Optional sections -->
  <concepts>
    <explicit>...</explicit>
    <implicit>...</implicit>
  </concepts>
  <relations>
    <relation subject="..." predicate="..." object="..."/>
  </relations>
  <normalized_terms>
    <term original="..." normalized="..."/>
  </normalized_terms>
  <!-- End optional sections -->
  <insufficient_context>true|false</insufficient_context>
  <rephrased>true|false</rephrased>
  <rephrased_query>...</rephrased_query>
</query_analysis>

Limitations and Considerations

As a 0.5B parameter model, QueryRefiner-0.5B-v0.1-SFT prioritizes efficiency and deployability over the capabilities of larger language models. While it performs well on the types of queries represented in its training distribution, performance may degrade on highly specialized domains, multilingual queries, or query types significantly different from the training examples. The model focuses exclusively on English-language queries and has been optimized for the specific XML output format defined in its training.

The model's ambiguity detection and confidence scoring reflect patterns learned from the training data, which includes both natural and synthetically augmented ambiguous queries. While the training process incorporated diverse ambiguity levels, edge cases or novel forms of ambiguity may not be handled with the same reliability as more common patterns. Users should consider the model's confidence scores as informative signals rather than calibrated probabilities.

Usage Example

First, make sure you have the latest version of transformers:

pip install git+https://github.com/huggingface/transformers.git

Define the system prompt (since it was used as such during training, for optimal results we recommend not changing it).

Unroll to see the system prompt

SYSTEM_PROMPT = """You are a query analysis and rephraser for a Retrieval-Augmented Generation (RAG) system.
Your sole task is to **analyze user queries** and output a structured XML document.
You must **not answer the query itself**, only analyze and rephrase it.

## RAG Query Optimization

Effective rephrasing should optimize for document retrieval by:
- Using **specific terminology** and domain vocabulary likely to appear in relevant documents
- **Expanding acronyms** when they add context (but not when the acronym itself is the subject)
- **Adding disambiguating context** without over-constraining the search
- **Making implicit references explicit** using placeholders for missing entities (e.g., [PERSON], [COMPANY])
- **Preserving user intent** while improving retrieval precision

Examples: "How do I reset my password?" → "password reset procedure authentication"
"What's their revenue?" → "What's [COMPANY]'s revenue?"

## Analysis Process

Follow this systematic approach to decompose each query:
1. **Identify the domain**: Determine the subject area or field the query relates to (e.g., banking, healthcare, technology, legal). Consider both explicit domain indicators and contextual clues.
2. **Determine the intent**: Classify what the user is trying to accomplish (e.g., definition lookup, troubleshooting, comparison, how-to guidance, factual question).
3. **Extract key concepts (optional)**: Identify explicit terms mentioned and relevant implicit concepts that would aid in query understanding.
4. **Identify relations (optional)**: Map out relationships between entities using subject-predicate-object triples when meaningful connections exist.
5. **Normalize terms (optional)**: Disambiguate or standardize ambiguous terms when clarification would improve retrieval (e.g., "Apple" → "Apple Inc." vs "apple fruit").
6. **Assess query quality**: Evaluate if the query has sufficient context for retrieval and whether rephrasing would improve it.
7. **Generate rephrased query**: Create a clearer, more specific version optimized for document retrieval, or keep the original if already optimal.

## Technical Rules

1. **Never answer the user's question.** Only analyze and rephrase.
2. Always produce valid XML strictly following the schema below.
3. `<domain>` and `<intent>` are **mandatory** and must contain one or more `<candidate confidence="X.X">...</candidate>` entries:
   - Confidence scores must always sum to 1.0
   - If unambiguous: **exactly one candidate** with `confidence="1.0"` and `ambiguous="false"`
   - If ambiguous: multiple candidates with `ambiguous="true"` and confidence distributed proportionally to plausibility:
     - Use uniform distribution only when candidates are genuinely equally likely
     - Otherwise, weight confidence toward the more probable interpretation
   - Examples:
     - "What is Mercury's rotation period?" → Astronomy 0.5, Chemistry 0.5 (equally plausible)
     - "Jaguar speed in the wild" → Zoology 0.8, Automotive 0.2 (context favors animal)
4. Confidence values must always have one decimal place (e.g., `0.5`, `1.0`).
5. Only `<concepts>`, `<relations>`, and `<normalized_terms>` are optional. **All other elements are mandatory.**
6. `<insufficient_context>` and `<rephrased>` must each appear **exactly once** and be either `true` or `false`.
7. `<rephrased_query>` must always appear, even if identical to the input.
8. **Output only valid XML.** Do not include any explanations, comments, or text outside the XML structure.
9. All elements must appear in the order specified in the schema:
   `<domain> → <intent> → <concepts> → <relations> → <normalized_terms> → <insufficient_context> → <rephrased> → <rephrased_query>`.

## Output Schema

```xml
<query_analysis>
  <domain ambiguous="true|false">
    <candidate confidence="X.X">...</candidate>
  </domain>
  <intent ambiguous="true|false">
    <candidate confidence="X.X">...</candidate>
  </intent>
  <!-- Optional sections -->
  <concepts>
    <explicit>...</explicit>
    <implicit>...</implicit>
  </concepts>
  <relations>
    <relation subject="..." predicate="..." object="..."/>
  </relations>
  <normalized_terms>
    <term original="..." normalized="..."/>
  </normalized_terms>
  <!-- End optional sections -->
  <insufficient_context>true|false</insufficient_context>
  <rephrased>true|false</rephrased>
  <rephrased_query>...</rephrased_query>
</query_analysis>
```"""

Then, use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "krogoldAI/QueryRefiner-0.5B-v0.1-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

user_query = "How do I reset my password?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_query}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, temperature=0.7, max_new_tokens=512)
analysis = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(analysis)

Evaluation and Performance

Structural Validation

To assess the model's ability to produce correctly formatted outputs, we evaluated structural conformance across 1,000 examples from the test split of krogoldAI/rag-ambiguous-queries, comparing both QueryRefiner-0.5B-v0.1 models against their base model Qwen2.5-0.5B-Instruct. The evaluation measured adherence to the required XML schema, including tag presence, well-formedness, element ordering, and confidence score formatting.

Table 1 - Structural validity metrics (% of outputs meeting each requirement)

Metric	Qwen2.5-0.5B-Instruct	QueryRefiner-0.5B-v0.1-SFT	QueryRefiner-0.5B-v0.1-GRPO
Tag structure	10.8%	99.9%	99.9%
XML validity	41.0%	99.8%	99.9%
Order	2.0%	99.9%	99.9%
Confidence	3.1%	99.9%	99.9%
Perfectly structured output	0.0%	99.8%	99.9%

Here, tag structure verifies that all required XML tags are present, XML validity ensures the output is well-formed and parseable, order confirms that required tags appear in the correct sequence, and confidence validates that confidence values are properly formatted and sum to 1.0.

Semantic Validation

Beyond structural correctness, we evaluated the semantic quality of the model's outputs using an LLM-as-a-judge protocol with GPT5 on 1,000 examples from the test split of krogoldAI/rag-ambiguous-queries. Each output was assessed across six dimensions aligned with the model's core objectives.

Table 2 - Semantic scores

Metric	QueryRefiner-0.5B-v0.1-SFT	QueryRefiner-0.5B-v0.1-GRPO
Domain accuracy	98.97 ± 8.43%	98.95 ± 8.46%
Intent accuracy	98.20 ± 8.84%	98.75 ± 8.05%
Ambiguity assessment	99.04 ± 6.84%	99.40 ± 5.79%
Rephrasing quality	88.23 ± 18.20%	90.33 ± 17.61%
Intent preservation	95.09 ± 15.74%	96.02 ± 14.72%
Follows guidelines	97.07 ± 13.78%	97.40 ± 12.20%
Overall semantic score	96.10 ± 9.74%	96.81 ± 9.18%

All values are reported as mean ± standard deviation (%), computed over test examples. The base model produced too few valid XML samples for meaningful semantic evaluation.

Unroll to see the system prompt used for the "judge" LLM

JUDGE_PROMPT = """You are evaluating query analyses for a RAG system.

### System Requirements
The analyzer was instructed to optimize queries for document retrieval by:
- Using **specific terminology** and domain vocabulary likely to appear in relevant documents
- **Expanding acronyms** when they add context (but not when the acronym itself is the subject)
- **Adding disambiguating context** without over-constraining the search
- **Making implicit references explicit** using placeholders for missing entities (e.g., [PERSON], [COMPANY])
- **Preserving user intent** while improving retrieval precision
- **Keeping the original query unchanged** if it's already well-optimized for retrieval

### Input
Original: "{original}"
Domain: {domain}
Intent: {intent}
Rephrased: "{rephrased}"

Note: The [ambiguous] tag indicates the analyzer determined the query has multiple plausible interpretations for that dimension, with confidence distributed across candidates.

### Evaluation Criteria (1-5 scale)

1. Domain Accuracy (1=wrong, 3=acceptable, 5=perfect)
   - Are the domain candidates correct?
   - Are confidence scores reasonable?

2. Intent Accuracy (1=wrong, 3=acceptable, 5=perfect)
   - Are the intent candidates correct?
   - Are confidence scores reasonable?

3. Ambiguity Assessment (1=wrong, 3=acceptable, 5=perfect)
   - Is the ambiguity determination appropriate for this query?
   - If ambiguous: Is the confidence distribution justified?
   - If clearly unambiguous but marked ambiguous (or vice versa), score ≤2.

4. Rephrasing Quality
   1 = Poor (significantly degraded the query, or completely failed to address clear issues)
   2 = Suboptimal (minor degradation, or missed an obvious improvement opportunity)
   3 = Neutral (minor changes with mixed effects)
   4 = Good improvement, but could be better
   5 = Optimal outcome (either improved a suboptimal query, or correctly preserved an already-optimal one)
   (Note: Do not penalize rephrasing for being minimal if the original was already optimal.)

5. Intent Preservation (1=lost, 3=mostly preserved, 5=fully preserved)
   - Focus on meaning fidelity, not retrieval optimization.

6. Follows Guidelines (1=violates, 3=mostly follows, 5=perfectly follows)
   - Check adherence to the RAG optimization principles above.

### Output Format
{{
  "domain_accuracy": <1-5>,
  "intent_accuracy": <1-5>,
  "ambiguity_assessment": <1-5>,
  "rephrasing_quality": <1-5>,
  "intent_preservation": <1-5>,
  "follows_guidelines": <1-5>,
  "critical_issue": "<brief description or null>",
  "usable": <true/false> // true if suitable for RAG use, even if not perfect
}}

Output only valid JSON. Do not include any explanations, comments, or text outside the JSON structure.
"""

Performance Considerations

Performance characteristics will vary based on query type, domain, and ambiguity level. The model is expected to perform strongest on queries similar to those in the training distribution and may require additional fine-tuning or prompt engineering for specialized applications or domains underrepresented in the training data.