# Backend Issues Analysis & Fixes

## 🔍 **Identified Problems**

### 1. **Streaming Issue - Sending Full Text Instead of Deltas**
**Location**: `app.py` line 1037-1053

**Problem**:
```python
for output in inference_backend.engine.generate([prompt], sampling_params, use_tqdm=False):
    if output.outputs:
        text = output.outputs[0].text  # ❌ This is the FULL accumulated text
        chunk = {"delta": {"content": text}}  # ❌ Sending full text as "delta"
```

**Issue**: vLLM's `generate()` returns the full accumulated text with each iteration, not just new tokens. We're sending the entire response repeatedly, which is why the UI had to implement delta extraction logic.

**Fix**: Track previous text and send only the difference.

---

### 2. **Missing Stop Tokens Configuration**
**Location**: `app.py` line 1029-1034

**Problem**:
```python
sampling_params = SamplingParams(
    temperature=temperature,
    max_tokens=max_tokens,
    top_p=0.9,
    repetition_penalty=1.05
)
# ❌ NO stop tokens configured!
```

**Issue**: Without proper stop tokens, the model doesn't know when to stop and continues generating, leading to:
- Conversation hallucinations (`User:`, `Assistant:` appearing)
- EOS tokens in output (`<|endoftext|>`, `</s>`)
- Responses that don't end cleanly

**Fix**: Add proper stop tokens based on model type.

---

### 3. **Prompt Format Causing Hallucinations**
**Location**: `app.py` line 1091-1103

**Problem**:
```python
prompt = ""
for message in messages:
    if role == "system":
        prompt += f"System: {content}\n"
    elif role == "user":
        prompt += f"User: {content}\n"
    elif role == "assistant":
        prompt += f"Assistant: {content}\n"
prompt += "Assistant:"
```

**Issue**: This simple format trains the model to continue the pattern, causing it to generate:
```
Assistant: [response] User: [hallucinated] Assistant: [more hallucination]
```

**Fix**: Use proper chat template from the model's tokenizer.

---

### 4. **Default max_tokens Too Low**
**Location**: `app.py` line 1088

**Problem**:
```python
max_tokens = request.get("max_tokens", 150)  # ❌ Too restrictive
```

**Issue**: 150 tokens is very limiting for financial explanations. Responses get cut off mid-sentence.

**Fix**: Increase default to 512-1024 tokens.

---

### 5. **No Model-Specific EOS Tokens**
**Location**: Multiple places

**Problem**: Each LinguaCustodia model has different EOS tokens:
- **llama3.1-8b**: `[128001, 128008, 128009]`
- **qwen3-8b**: `[151645, 151643]`  
- **gemma3-12b**: `[1, 106]`

But we're not using any of them in vLLM SamplingParams!

**Fix**: Load EOS tokens from model config and pass to vLLM.

---

### 6. **Repetition Penalty Too Low**
**Location**: `app.py` line 1033

**Problem**:
```python
repetition_penalty=1.05  # Too weak for preventing loops
```

**Issue**: Financial models can get stuck in repetitive patterns. 1.05 is barely noticeable.

**Fix**: Increase to 1.1-1.15 for better repetition prevention.

---

## ✅ **Recommended Fixes**

### Priority 1: Fix Streaming (Critical for UX)
```python
async def stream_chat_completion(prompt: str, model: str, temperature: float, max_tokens: int, request_id: str):
    try:
        from vllm import SamplingParams
        
        # Get model-specific stop tokens
        stop_tokens = get_stop_tokens_for_model(model)
        
        sampling_params = SamplingParams(
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=0.9,
            repetition_penalty=1.1,
            stop=stop_tokens  # ✅ Add stop tokens
        )
        
        previous_text = ""  # ✅ Track what we've sent
        
        for output in inference_backend.engine.generate([prompt], sampling_params, use_tqdm=False):
            if output.outputs:
                current_text = output.outputs[0].text
                
                # ✅ Send only the NEW part
                new_text = current_text[len(previous_text):]
                if new_text:
                    chunk = {
                        "id": request_id,
                        "object": "chat.completion.chunk",
                        "created": int(time.time()),
                        "model": model,
                        "choices": [{
                            "index": 0,
                            "delta": {"content": new_text},  # ✅ True delta
                            "finish_reason": None
                        }]
                    }
                    yield f"data: {json.dumps(chunk)}\n\n"
                    previous_text = current_text
```

### Priority 2: Use Proper Chat Templates
```python
def format_chat_prompt(messages: List[Dict], model_name: str) -> str:
    """Format messages using model's chat template."""
    
    # Load tokenizer to get chat template
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained(f"LinguaCustodia/{model_name}")
    
    # Use built-in chat template if available
    if hasattr(tokenizer, 'apply_chat_template'):
        prompt = tokenizer.apply_chat_template(
            messages, 
            tokenize=False,
            add_generation_prompt=True
        )
        return prompt
    
    # Fallback for models without template
    # ... existing logic
```

### Priority 3: Model-Specific Stop Tokens
```python
def get_stop_tokens_for_model(model_name: str) -> List[str]:
    """Get stop tokens based on model."""
    
    model_stops = {
        "llama3.1-8b": ["<|end_of_text|>", "<|eot_id|>", "\nUser:", "\nAssistant:"],
        "qwen3-8b": ["<|im_end|>", "<|endoftext|>", "\nUser:", "\nAssistant:"],
        "gemma3-12b": ["<end_of_turn>", "<eos>", "\nUser:", "\nAssistant:"],
    }
    
    for key in model_stops:
        if key in model_name.lower():
            return model_stops[key]
    
    # Default stops
    return ["<|endoftext|>", "</s>", "\nUser:", "\nAssistant:", "\nSystem:"]
```

### Priority 4: Better Defaults
```python
# In /v1/chat/completions endpoint
max_tokens = request.get("max_tokens", 512)  # ✅ Increased from 150
temperature = request.get("temperature", 0.6)
repetition_penalty = request.get("repetition_penalty", 1.1)  # ✅ Increased from 1.05
```

---

## 🎯 **Expected Results After Fixes**

1. ✅ **True Token-by-Token Streaming** - UI sees smooth word-by-word generation
2. ✅ **Clean Responses** - No EOS tokens in output
3. ✅ **No Hallucinations** - Model stops at proper boundaries  
4. ✅ **Longer Responses** - Default 512 tokens allows complete answers
5. ✅ **Less Repetition** - Stronger penalty prevents loops
6. ✅ **Model-Specific Handling** - Each model uses its own stop tokens

---

## 📝 **Implementation Order**

1. **Fix streaming delta calculation** (10 min) - Immediate UX improvement
2. **Add stop tokens to SamplingParams** (15 min) - Prevents hallucinations
3. **Implement get_stop_tokens_for_model()** (20 min) - Model-specific handling
4. **Use chat templates** (30 min) - Proper prompt formatting
5. **Update defaults** (5 min) - Better out-of-box experience
6. **Test with all 3 models** (30 min) - Verify fixes work

**Total Time**: ~2 hours for complete fix