# Backend Issues Analysis & Fixes ## 🔍 **Identified Problems** ### 1. **Streaming Issue - Sending Full Text Instead of Deltas** **Location**: `app.py` line 1037-1053 **Problem**: ```python for output in inference_backend.engine.generate([prompt], sampling_params, use_tqdm=False): if output.outputs: text = output.outputs[0].text # ❌ This is the FULL accumulated text chunk = {"delta": {"content": text}} # ❌ Sending full text as "delta" ``` **Issue**: vLLM's `generate()` returns the full accumulated text with each iteration, not just new tokens. We're sending the entire response repeatedly, which is why the UI had to implement delta extraction logic. **Fix**: Track previous text and send only the difference. --- ### 2. **Missing Stop Tokens Configuration** **Location**: `app.py` line 1029-1034 **Problem**: ```python sampling_params = SamplingParams( temperature=temperature, max_tokens=max_tokens, top_p=0.9, repetition_penalty=1.05 ) # ❌ NO stop tokens configured! ``` **Issue**: Without proper stop tokens, the model doesn't know when to stop and continues generating, leading to: - Conversation hallucinations (`User:`, `Assistant:` appearing) - EOS tokens in output (`<|endoftext|>`, ``) - Responses that don't end cleanly **Fix**: Add proper stop tokens based on model type. --- ### 3. **Prompt Format Causing Hallucinations** **Location**: `app.py` line 1091-1103 **Problem**: ```python prompt = "" for message in messages: if role == "system": prompt += f"System: {content}\n" elif role == "user": prompt += f"User: {content}\n" elif role == "assistant": prompt += f"Assistant: {content}\n" prompt += "Assistant:" ``` **Issue**: This simple format trains the model to continue the pattern, causing it to generate: ``` Assistant: [response] User: [hallucinated] Assistant: [more hallucination] ``` **Fix**: Use proper chat template from the model's tokenizer. --- ### 4. **Default max_tokens Too Low** **Location**: `app.py` line 1088 **Problem**: ```python max_tokens = request.get("max_tokens", 150) # ❌ Too restrictive ``` **Issue**: 150 tokens is very limiting for financial explanations. Responses get cut off mid-sentence. **Fix**: Increase default to 512-1024 tokens. --- ### 5. **No Model-Specific EOS Tokens** **Location**: Multiple places **Problem**: Each LinguaCustodia model has different EOS tokens: - **llama3.1-8b**: `[128001, 128008, 128009]` - **qwen3-8b**: `[151645, 151643]` - **gemma3-12b**: `[1, 106]` But we're not using any of them in vLLM SamplingParams! **Fix**: Load EOS tokens from model config and pass to vLLM. --- ### 6. **Repetition Penalty Too Low** **Location**: `app.py` line 1033 **Problem**: ```python repetition_penalty=1.05 # Too weak for preventing loops ``` **Issue**: Financial models can get stuck in repetitive patterns. 1.05 is barely noticeable. **Fix**: Increase to 1.1-1.15 for better repetition prevention. --- ## ✅ **Recommended Fixes** ### Priority 1: Fix Streaming (Critical for UX) ```python async def stream_chat_completion(prompt: str, model: str, temperature: float, max_tokens: int, request_id: str): try: from vllm import SamplingParams # Get model-specific stop tokens stop_tokens = get_stop_tokens_for_model(model) sampling_params = SamplingParams( temperature=temperature, max_tokens=max_tokens, top_p=0.9, repetition_penalty=1.1, stop=stop_tokens # ✅ Add stop tokens ) previous_text = "" # ✅ Track what we've sent for output in inference_backend.engine.generate([prompt], sampling_params, use_tqdm=False): if output.outputs: current_text = output.outputs[0].text # ✅ Send only the NEW part new_text = current_text[len(previous_text):] if new_text: chunk = { "id": request_id, "object": "chat.completion.chunk", "created": int(time.time()), "model": model, "choices": [{ "index": 0, "delta": {"content": new_text}, # ✅ True delta "finish_reason": None }] } yield f"data: {json.dumps(chunk)}\n\n" previous_text = current_text ``` ### Priority 2: Use Proper Chat Templates ```python def format_chat_prompt(messages: List[Dict], model_name: str) -> str: """Format messages using model's chat template.""" # Load tokenizer to get chat template from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(f"LinguaCustodia/{model_name}") # Use built-in chat template if available if hasattr(tokenizer, 'apply_chat_template'): prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) return prompt # Fallback for models without template # ... existing logic ``` ### Priority 3: Model-Specific Stop Tokens ```python def get_stop_tokens_for_model(model_name: str) -> List[str]: """Get stop tokens based on model.""" model_stops = { "llama3.1-8b": ["<|end_of_text|>", "<|eot_id|>", "\nUser:", "\nAssistant:"], "qwen3-8b": ["<|im_end|>", "<|endoftext|>", "\nUser:", "\nAssistant:"], "gemma3-12b": ["", "", "\nUser:", "\nAssistant:"], } for key in model_stops: if key in model_name.lower(): return model_stops[key] # Default stops return ["<|endoftext|>", "", "\nUser:", "\nAssistant:", "\nSystem:"] ``` ### Priority 4: Better Defaults ```python # In /v1/chat/completions endpoint max_tokens = request.get("max_tokens", 512) # ✅ Increased from 150 temperature = request.get("temperature", 0.6) repetition_penalty = request.get("repetition_penalty", 1.1) # ✅ Increased from 1.05 ``` --- ## 🎯 **Expected Results After Fixes** 1. ✅ **True Token-by-Token Streaming** - UI sees smooth word-by-word generation 2. ✅ **Clean Responses** - No EOS tokens in output 3. ✅ **No Hallucinations** - Model stops at proper boundaries 4. ✅ **Longer Responses** - Default 512 tokens allows complete answers 5. ✅ **Less Repetition** - Stronger penalty prevents loops 6. ✅ **Model-Specific Handling** - Each model uses its own stop tokens --- ## 📝 **Implementation Order** 1. **Fix streaming delta calculation** (10 min) - Immediate UX improvement 2. **Add stop tokens to SamplingParams** (15 min) - Prevents hallucinations 3. **Implement get_stop_tokens_for_model()** (20 min) - Model-specific handling 4. **Use chat templates** (30 min) - Proper prompt formatting 5. **Update defaults** (5 min) - Better out-of-box experience 6. **Test with all 3 models** (30 min) - Verify fixes work **Total Time**: ~2 hours for complete fix