# ๐ User Guide - ZeroGPU LLM Inference
## Quick Start (5 Minutes)
### 1. Choose Your Model
The model dropdown shows 30+ options organized by size:
- **Compact (<2B)**: Fast, lightweight - great for quick responses
- **Mid-size (2-8B)**: Best balance of speed and quality
- **Large (14B+)**: Highest quality, slower but more capable
**Recommendation for beginners**: Start with `Qwen3-4B-Instruct-2507`
### 2. Try an Example Prompt
Click on any example below the chat box to get started:
- "Explain quantum computing in simple terms"
- "Write a Python function..."
- "What are the latest developments..." (requires web search)
### 3. Start Chatting!
Type your message and press Enter or click "๐ค Send"
## Core Features
### ๐ฌ Chat Interface
The main chat area shows:
- Your messages on one side
- AI responses with a ๐ค avatar
- Copy button on each message
- Smooth streaming as tokens generate
**Tips:**
- Press Enter to send (Shift+Enter for new line)
- Click Copy button to save responses
- Scroll up to review history
- Use Clear Chat to start fresh
### ๐ค Model Selection
**When to use each size:**
| Model Size | Best For | Speed | Quality |
|------------|----------|-------|---------|
| <2B | Quick questions, testing | โกโกโก | โญโญ |
| 2-8B | General chat, coding help | โกโก | โญโญโญ |
| 14B+ | Complex reasoning, long-form | โก | โญโญโญโญ |
**Specialized Models:**
- **Phi-4-mini-Reasoning**: Math, logic problems
- **Qwen2.5-Coder**: Programming tasks
- **DeepSeek-R1-Distill**: Step-by-step reasoning
- **Apriel-1.5-15b-Thinker**: Multimodal understanding
### ๐ Web Search
Enable this when you need:
- Current events and news
- Recent information (after model training cutoff)
- Facts that change frequently
- Real-time data
**How it works:**
1. Toggle "๐ Enable Web Search"
2. Web search settings accordion appears
3. System prompt updates automatically
4. Search runs in background (won't block chat)
5. Results injected into context
**Settings explained:**
- **Max Results**: How many search results to fetch (4 is good default)
- **Max Chars/Result**: Limit length per result (50 prevents overwhelming context)
- **Search Timeout**: Maximum wait time (5s recommended)
### ๐ System Prompt
This defines the AI's personality and behavior.
**Default prompts:**
- Without search: Helpful, creative assistant
- With search: Includes search results and current date
**Customization ideas:**
```
You are a professional code reviewer...
You are a creative writing coach...
You are a patient tutor explaining concepts simply...
You are a technical documentation writer...
```
## Advanced Features
### ๐๏ธ Advanced Generation Parameters
Click the accordion to reveal these controls:
#### Max Tokens (64-16384)
- **What it does**: Sets maximum response length
- **Lower (256-512)**: Quick, concise answers
- **Medium (1024)**: Balanced (default)
- **Higher (2048+)**: Long-form content, detailed explanations
#### Temperature (0.1-2.0)
- **What it does**: Controls randomness/creativity
- **Low (0.1-0.3)**: Focused, deterministic (good for facts, code)
- **Medium (0.7)**: Balanced creativity (default)
- **High (1.2-2.0)**: Very creative, unpredictable (stories, brainstorming)
#### Top-K (1-100)
- **What it does**: Limits token choices to top K most likely
- **Lower (10-20)**: More focused
- **Medium (40)**: Balanced (default)
- **Higher (80-100)**: More varied vocabulary
#### Top-P (0.1-1.0)
- **What it does**: Nucleus sampling threshold
- **Lower (0.5-0.7)**: Conservative choices
- **Medium (0.9)**: Balanced (default)
- **Higher (0.95-1.0)**: Full vocabulary range
#### Repetition Penalty (1.0-2.0)
- **What it does**: Reduces repeated words/phrases
- **Low (1.0-1.1)**: Allows some repetition
- **Medium (1.2)**: Balanced (default)
- **High (1.5+)**: Strongly avoids repetition (may hurt coherence)
### Preset Configurations
**For Creative Writing:**
```
Temperature: 1.2
Top-P: 0.95
Top-K: 80
Max Tokens: 2048
```
**For Code Generation:**
```
Temperature: 0.3
Top-P: 0.9
Top-K: 40
Max Tokens: 1024
Repetition Penalty: 1.1
```
**For Factual Q&A:**
```
Temperature: 0.5
Top-P: 0.85
Top-K: 30
Max Tokens: 512
Enable Web Search: Yes
```
**For Reasoning Tasks:**
```
Model: Phi-4-mini-Reasoning or DeepSeek-R1
Temperature: 0.7
Max Tokens: 2048
```
## Tips & Tricks
### ๐ฏ Getting Better Results
1. **Be Specific**: "Write a Python function to sort a list" โ "Write a Python function that sorts a list of dictionaries by a specific key"
2. **Provide Context**: "Explain recursion" โ "Explain recursion to someone learning programming for the first time, with a simple example"
3. **Use System Prompts**: Define role/expertise in system prompt instead of every message
4. **Iterate**: Use follow-up questions to refine responses
5. **Experiment with Models**: Try different models for the same task
### โก Performance Tips
1. **Start Small**: Test with smaller models first
2. **Adjust Max Tokens**: Don't request more than you need
3. **Use Cancel**: Stop bad generations early
4. **Clear Cache**: Clear chat if experiencing slowdowns
5. **One Task at a Time**: Don't send multiple requests simultaneously
### ๐ When to Use Web Search
**โ
Good use cases:**
- "What happened in the latest SpaceX launch?"
- "Current cryptocurrency prices"
- "Recent AI research papers"
- "Today's weather in Paris"
**โ Don't need search for:**
- General knowledge questions
- Code writing/debugging
- Math problems
- Creative writing
- Theoretical explanations
### ๐ญ Understanding Thinking Mode
Some models output `...` blocks:
```
Let me break this down step by step...
First, I need to consider...
Here's the answer: ...
```
**In the UI:**
- Thinking shows as "๐ญ Thought"
- Answer shows separately
- Helps you see the reasoning process
**Best for:**
- Complex math problems
- Multi-step reasoning
- Debugging logic
- Learning how AI thinks
## Troubleshooting
### Generation is Slow
- Try a smaller model
- Reduce Max Tokens
- Disable web search if not needed
- Clear chat history
### Responses are Repetitive
- Increase Repetition Penalty
- Reduce Temperature slightly
- Try different model
### Responses are Random/Nonsensical
- Decrease Temperature
- Reduce Top-P
- Reduce Top-K
- Try more stable model
### Web Search Not Working
- Check timeout isn't too short
- Verify internet connection
- Try increasing Max Results
- Check search query in debug panel
### Cancel Button Doesn't Work
- Wait a moment (might be processing)
- Refresh page if persists
- Check browser console for errors
## Keyboard Shortcuts
- **Enter**: Send message
- **Shift+Enter**: New line in input
- **Ctrl+C**: Copy (when text selected)
- **Ctrl+A**: Select all in input
## Best Practices
### For Beginners
1. Start with example prompts
2. Use default settings initially
3. Try 2-4 different models
4. Gradually explore advanced settings
5. Read responses fully before replying
### For Power Users
1. Create custom system prompts
2. Fine-tune parameters per task
3. Use debug panel for prompt engineering
4. Experiment with model combinations
5. Utilize web search strategically
### For Developers
1. Study the debug output
2. Test code generation thoroughly
3. Use lower temperature for determinism
4. Compare multiple models
5. Save working configurations
## Privacy & Safety
- **No data collection**: Conversations not stored permanently
- **Model limitations**: May produce incorrect information
- **Verify important info**: Don't rely solely on AI for critical decisions
- **Web search**: Uses DuckDuckGo (privacy-focused)
- **Open source**: Code is transparent and auditable
## Support & Feedback
Found a bug? Have a suggestion?
- Check GitHub issues
- Submit feature requests
- Contribute improvements
- Share your use cases
---
**Happy chatting! ๐**