Spaces:

Haiss123
/

Embeddings_Chat

Sleeping

App Files Files Community

Haiss123 commited on 19 days ago

Commit

6b98b09

verified ·

1 Parent(s): 82189df

Upload 20 files

Browse files

Files changed (19) hide show

ADVANCED_RAG_GUIDE.md +256 -0
MULTIMODAL_PDF_GUIDE.md +525 -0
PDF_RAG_GUIDE.md +390 -0
QUICK_START_PDF.md +310 -0
SUMMARY.md +429 -0
advanced_rag.py +301 -0
app.py +47 -0
batch_index_pdfs.py +151 -0
chatbot_guide_template.md +369 -0
chatbot_rag.py +351 -0
chatbot_rag_api.py +468 -0
embedding_service.py +173 -0
main.py +1285 -0
multimodal_pdf_parser.py +390 -0
pdf_parser.py +371 -0
qdrant_service.py +447 -0
requirements.txt +34 -0
test_advanced_features.py +260 -0
verify_dependencies.py +102 -0

ADVANCED_RAG_GUIDE.md ADDED Viewed

	@@ -0,0 +1,256 @@

+# Advanced RAG Chatbot - User Guide
+## What's New?
+### 1. Multiple Images & Texts Support in `/index` API
+The `/index` endpoint now supports indexing multiple texts and images in a single request (max 10 each).
+**Before:**
+```python
+# Old: Only 1 text and 1 image
+data = {
+    'id': 'doc1',
+    'text': 'Single text',
+}
+files = {'image': open('image.jpg', 'rb')}
+```
+**After:**
+```python
+# New: Multiple texts and images (max 10 each)
+data = {
+    'id': 'doc1',
+    'texts': ['Text 1', 'Text 2', 'Text 3'],  # Up to 10
+}
+files = [
+    ('images', open('image1.jpg', 'rb')),
+    ('images', open('image2.jpg', 'rb')),
+    ('images', open('image3.jpg', 'rb')),  # Up to 10
+]
+response = requests.post('http://localhost:8000/index', data=data, files=files)
+```
+**Example with cURL:**
+```bash
+curl -X POST "http://localhost:8000/index" \
+  -F "id=event123" \
+  -F "texts=Sự kiện âm nhạc tại Hà Nội" \
+  -F "texts=Diễn ra vào ngày 20/10/2025" \
+  -F "texts=Địa điểm: Trung tâm Hội nghị Quốc gia" \
+  -F "[email protected]" \
+  -F "[email protected]" \
+  -F "[email protected]"
+```
+### 2. Advanced RAG Pipeline in `/chat` API
+The chat endpoint now uses modern RAG techniques for better response quality:
+#### Key Improvements:
+1. **Query Expansion**: Automatically expands your question with variations
+2. **Multi-Query Retrieval**: Searches with multiple query variants
+3. **Reranking**: Re-scores results for better relevance
+4. **Contextual Compression**: Keeps only the most relevant parts
+5. **Better Prompt Engineering**: Optimized prompts for LLM
+#### How to Use:
+**Basic Usage (Auto-enabled):**
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Dao có nguy hiểm không?',
+    'use_rag': True,
+    'use_advanced_rag': True,  # Default: True
+    'hf_token': 'hf_xxxxx'
+})
+result = response.json()
+print("Response:", result['response'])
+print("RAG Stats:", result['rag_stats'])  # See pipeline statistics
+```
+**Advanced Configuration:**
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    # RAG Pipeline Options
+    'use_query_expansion': True,    # Expand query with variations
+    'use_reranking': True,          # Rerank results
+    'use_compression': True,        # Compress context
+    'score_threshold': 0.5,         # Min relevance score (0-1)
+    'top_k': 5,                     # Number of documents to retrieve
+    # LLM Options
+    'max_tokens': 512,
+    'temperature': 0.7,
+    'hf_token': 'hf_xxxxx'
+})
+```
+**Disable Advanced RAG (Use Basic):**
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Your question',
+    'use_rag': True,
+    'use_advanced_rag': False,  # Use basic RAG
+})
+```
+## API Changes Summary
+### `/index` Endpoint
+**Old Parameters:**
+- `id`: str (required)
+- `text`: str (required)
+- `image`: UploadFile (optional)
+**New Parameters:**
+- `id`: str (required)
+- `texts`: List[str] (optional, max 10)
+- `images`: List[UploadFile] (optional, max 10)
+**Response:**
+```json
+{
+  "success": true,
+  "id": "doc123",
+  "message": "Đã index thành công document doc123 với 3 texts và 2 images"
+}
+```
+### `/chat` Endpoint
+**New Parameters:**
+- `use_advanced_rag`: bool (default: True) - Enable advanced RAG
+- `use_query_expansion`: bool (default: True) - Expand query
+- `use_reranking`: bool (default: True) - Rerank results
+- `use_compression`: bool (default: True) - Compress context
+- `score_threshold`: float (default: 0.5) - Min relevance score
+**Response (New):**
+```json
+{
+  "response": "AI generated answer...",
+  "context_used": [...],
+  "timestamp": "2025-10-29T...",
+  "rag_stats": {
+    "original_query": "Your question",
+    "expanded_queries": ["Query variant 1", "Query variant 2"],
+    "initial_results": 10,
+    "after_rerank": 5,
+    "after_compression": 5
+  }
+}
+```
+## Complete Examples
+### Example 1: Index Multiple Social Media Posts
+```python
+import requests
+# Index a social media event with multiple posts and images
+data = {
+    'id': 'event_festival_2025',
+    'texts': [
+        'Festival âm nhạc quốc tế Hà Nội 2025',
+        'Ngày 15-17 tháng 11 năm 2025',
+        'Địa điểm: Công viên Thống Nhất',
+        'Line-up: Sơn Tùng MTP, Đen Vâu, Hoàng Thùy Linh',
+        'Giá vé từ 500.000đ - 2.000.000đ'
+    ]
+}
+files = [
+    ('images', open('poster_festival.jpg', 'rb')),
+    ('images', open('lineup.jpg', 'rb')),
+    ('images', open('venue_map.jpg', 'rb'))
+]
+response = requests.post('http://localhost:8000/index', data=data, files=files)
+print(response.json())
+```
+### Example 2: Advanced RAG Chat
+```python
+import requests
+# Chat with advanced RAG
+chat_response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Festival âm nhạc Hà Nội diễn ra khi nào và ở đâu?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 3,
+    'score_threshold': 0.6,
+    'hf_token': 'your_hf_token_here'
+})
+result = chat_response.json()
+print("Answer:", result['response'])
+print("\nRetrieved Context:")
+for ctx in result['context_used']:
+    print(f"- [{ctx['id']}] Confidence: {ctx['confidence']:.2%}")
+print("\nRAG Pipeline Stats:")
+print(f"- Original query: {result['rag_stats']['original_query']}")
+print(f"- Query variants: {result['rag_stats']['expanded_queries']}")
+print(f"- Documents retrieved: {result['rag_stats']['initial_results']}")
+print(f"- After reranking: {result['rag_stats']['after_rerank']}")
+```
+## Performance Comparison
+| Feature | Basic RAG | Advanced RAG |
+|---------|-----------|--------------|
+| Query Understanding | Single query | Multiple query variants |
+| Retrieval Method | Direct vector search | Multi-query + hybrid |
+| Result Ranking | Score from DB | Reranked with semantic similarity |
+| Context Quality | Full text | Compressed, relevant parts only |
+| Response Accuracy | Good | Better |
+| Response Time | Faster | Slightly slower but better quality |
+## When to Use What?
+**Use Basic RAG when:**
+- You need fast response time
+- Queries are straightforward
+- Context is already well-structured
+**Use Advanced RAG when:**
+- You need higher accuracy
+- Queries are complex or ambiguous
+- Context documents are long
+- You want better relevance
+## Troubleshooting
+### Error: "Tối đa 10 texts"
+You're sending more than 10 texts. Reduce to max 10.
+### Error: "Tối đa 10 images"
+You're sending more than 10 images. Reduce to max 10.
+### RAG stats show 0 results
+Your `score_threshold` might be too high. Try lowering it (e.g., 0.3-0.5).
+## Next Steps
+To further improve RAG, consider:
+1. **Add BM25 Hybrid Search**: Combine dense + sparse retrieval
+2. **Use Cross-Encoder for Reranking**: Better than embedding similarity
+3. **Implement Query Decomposition**: Break complex queries into sub-queries
+4. **Add Citation/Source Tracking**: Show which document each fact comes from
+5. **Integrate RAG-Anything**: For advanced multimodal document processing
+For RAG-Anything integration (more complex), see: https://github.com/HKUDS/RAG-Anything

MULTIMODAL_PDF_GUIDE.md ADDED Viewed

	@@ -0,0 +1,525 @@

+# Multimodal PDF Guide - PDFs với Text + Hình Ảnh
+## Tổng Quan
+Hệ thống giờ hỗ trợ **Multimodal PDF** - PDFs có:
+- ✅ Text hướng dẫn
+- ✅ Image URLs (links đến hình ảnh)
+- ✅ Markdown images: `![alt](url)`
+- ✅ HTML images: `<img src="url">`
+**Perfect cho**: User guides với screenshots, tutorials với diagrams, documentation với visual aids.
+---
+## Tại Sao Cần Multimodal?
+### Vấn Đề Với PDF Thông Thường
+PDF hướng dẫn thường có:
+```
+Bước 1: Mở trang chủ
+[Xem hình ảnh: https://example.com/homepage.png]
+Bước 2: Click vào "Tạo mới"
+![Create button](https://example.com/create-button.png)
+Bước 3: Điền thông tin
+<img src="https://example.com/form.png" alt="Form" />
+```
+**PDF parser cũ** chỉ extract text → **MẤT hết image URLs** → Chatbot không biết hình ảnh nào liên quan!
+**Multimodal PDF parser mới**:
+- ✓ Extract text
+- ✓ Detect tất cả image URLs
+- ✓ Link images với text chunks tương ứng
+- ✓ Store URLs trong metadata
+- ✓ Return images cùng text khi chat
+---
+## So Sánh: PDF Thường vs Multimodal PDF
+| Feature | PDF Thường (`/upload-pdf`) | Multimodal PDF (`/upload-pdf-multimodal`) |
+|---------|---------------------------|-------------------------------------------|
+| Extract text | ✓ | ✓ |
+| Detect image URLs | ✗ | ✓ |
+| Link images to chunks | ✗ | ✓ |
+| Return images in chat | ✗ | ✓ |
+| URL formats supported | ✗ | http://, https://, markdown, HTML |
+| Use case | Simple text documents | User guides, tutorials, docs with images |
+---
+## Cách Sử Dụng
+### 1. Upload Multimodal PDF
+**Endpoint:** `POST /upload-pdf-multimodal`
+**Curl:**
+```bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+  -F "file=@user_guide_with_images.pdf" \
+  -F "title=Hướng dẫn sử dụng hệ thống" \
+  -F "description=User guide with screenshots" \
+  -F "category=user_guide"
+```
+**Python:**
+```python
+import requests
+with open('user_guide_with_images.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf-multimodal',
+        files={'file': f},
+        data={
+            'title': 'User Guide with Screenshots',
+            'category': 'user_guide'
+        }
+    )
+result = response.json()
+print(f"Indexed: {result['chunks_indexed']} chunks")
+print(f"Images found: {result['message']}")
+```
+**Response:**
+```json
+{
+  "success": true,
+  "document_id": "pdf_multimodal_20251029_150000",
+  "filename": "user_guide_with_images.pdf",
+  "chunks_indexed": 25,
+  "message": "PDF 'user_guide_with_images.pdf' indexed successfully with 25 chunks and 15 images"
+}
+```
+### 2. Chat Với Multimodal Context
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 3,
+    'hf_token': 'your_token'
+})
+result = response.json()
+# Response text
+print("Answer:", result['response'])
+# Retrieved context with images
+for ctx in result['context_used']:
+    print(f"\n--- Source: Page {ctx['metadata']['page']} ---")
+    print(f"Text: {ctx['metadata']['text'][:200]}...")
+    # Check if this chunk has images
+    if ctx['metadata'].get('has_images'):
+        print(f"Images ({ctx['metadata']['num_images']}):")
+        for img_url in ctx['metadata'].get('image_urls', []):
+            print(f"  - {img_url}")
+```
+**Example Output:**
+```
+Answer: Để tạo event mới, bạn thực hiện các bước sau:
+1. Mở trang chủ và click vào nút "Tạo Event" (xem hình minh họa)
+2. Điền thông tin event...
+--- Source: Page 5 ---
+Text: Bước 1: Mở trang chủ và click vào nút "Tạo Event"...
+Images (2):
+  - https://example.com/homepage.png
+  - https://example.com/create-button.png
+```
+---
+## Cách Chuẩn Bị PDF
+### Format Hỗ Trợ
+Multimodal parser detect các format sau:
+1. **Standard URLs:**
+   ```
+   Xem hình: https://example.com/image.png
+   Screenshot: http://cdn.example.com/screenshot.jpg
+   ```
+2. **Markdown Images:**
+   ```markdown
+   ![Homepage](https://example.com/homepage.png)
+   ![Button](https://example.com/button.png)
+   ```
+3. **HTML Images:**
+   ```html
+   <img src="https://example.com/form.png" alt="Form" />
+   <img src="http://example.com/result.jpg">
+   ```
+4. **Image Extensions:**
+   ```
+   https://example.com/pic.jpg
+   https://example.com/chart.png
+   https://example.com/diagram.svg
+   ```
+### Best Practices
+#### ✓ Tốt
+**PDF Content Example:**
+```
+# Hướng Dẫn Tạo Event
+## Bước 1: Mở Trang Chủ
+Truy cập vào trang chủ hệ thống tại homepage.
+![Homepage Screenshot](https://docs.example.com/images/homepage.png)
+Bạn sẽ thấy màn hình chính với menu bên trái.
+## Bước 2: Click "Tạo Event"
+Tìm và click vào nút "Tạo Event" ở góc trên phải.
+![Create Event Button](https://docs.example.com/images/create-button.png)
+## Bước 3: Điền Thông Tin
+Điền các thông tin sau vào form:
+- Tên event
+- Ngày giờ
+- Địa điểm
+Xem mẫu form: https://docs.example.com/images/event-form.png
+```
+**Why good:**
+- Có cấu trúc rõ ràng (headings)
+- Mỗi bước có text + hình ảnh
+- URLs rõ ràng, dễ detect
+- Context gắn chặt với hình
+#### ✗ Tránh
+```
+Xem các hình dưới đây [1] [2] [3]
+[Các hình ảnh ở cuối tài liệu]
+...
+[1] homepage.png
+[2] button.png
+[3] form.png
+```
+**Why bad:**
+- Images references không có URLs
+- Images tách biệt khỏi context
+- Không có full URLs (chỉ filenames)
+---
+## Ví Dụ Thực Tế
+### Tạo PDF Hướng Dẫn Multimodal
+**File: `chatbot_guide_with_images.md`**
+```markdown
+# Hướng Dẫn Sử Dụng ChatbotRAG
+## 1. Upload PDF
+### Bước 1: Chuẩn bị file PDF
+Đảm bảo file PDF của bạn đã sẵn sàng.
+![PDF File Icon](https://via.placeholder.com/150?text=PDF+File)
+### Bước 2: Sử dụng cURL hoặc Python
+**Với cURL:**
+\`\`\`bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \\
+  -F "file=@your_file.pdf"
+\`\`\`
+![cURL Command Example](https://via.placeholder.com/400x100?text=cURL+Command)
+**Với Python:**
+\`\`\`python
+import requests
+# Upload code here
+\`\`\`
+### Bước 3: Verify Upload
+Kiểm tra kết quả upload:
+https://via.placeholder.com/500x300?text=Upload+Success+Message
+## 2. Chat Với Chatbot
+Sau khi upload, bạn có thể hỏi chatbot:
+![Chat Interface](https://via.placeholder.com/600x400?text=Chat+Interface)
+**Ví dụ câu hỏi:**
+- "Làm sao để upload PDF?"
+- "Các bước tạo event là gì?"
+![Chat Example](https://via.placeholder.com/600x300?text=Chat+Example)
+## 3. Xem Kết Quả
+Chatbot sẽ trả lời dựa trên PDF content:
+https://via.placeholder.com/600x350?text=Chat+Response+with+Images
+```
+**Convert to PDF:**
+```bash
+pandoc chatbot_guide_with_images.md -o chatbot_guide_with_images.pdf
+```
+**Upload:**
+```bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+  -F "file=@chatbot_guide_with_images.pdf" \
+  -F "title=ChatbotRAG Guide" \
+  -F "category=user_guide"
+```
+---
+## Advanced: Custom Image Handling
+### Option 1: Local Images
+Nếu images ở local, bạn cần host chúng:
+```bash
+# Simple HTTP server
+cd /path/to/images
+python -m http.server 8080
+# Images available at:
+# http://localhost:8080/image1.png
+# http://localhost:8080/image2.png
+```
+Trong PDF, reference:
+```
+![Image](http://localhost:8080/image1.png)
+```
+### Option 2: Cloud Storage
+Upload images lên cloud (AWS S3, Cloudinary, Imgur, etc.):
+```python
+# Example: Upload to Imgur
+import requests
+def upload_to_imgur(image_path):
+    client_id = 'YOUR_CLIENT_ID'
+    headers = {'Authorization': f'Client-ID {client_id}'}
+    with open(image_path, 'rb') as img:
+        response = requests.post(
+            'https://api.imgur.com/3/image',
+            headers=headers,
+            files={'image': img}
+        )
+    return response.json()['data']['link']
+# Upload images
+url1 = upload_to_imgur('screenshot1.png')
+url2 = upload_to_imgur('screenshot2.png')
+# Use URLs in PDF
+print(f"![Screenshot 1]({url1})")
+```
+### Option 3: Embed Images as Base64
+Nếu PDF có images embedded, extract chúng:
+```python
+import pypdfium2 as pdfium
+from PIL import Image
+import io
+import base64
+def extract_images_from_pdf(pdf_path):
+    """Extract embedded images from PDF"""
+    pdf = pdfium.PdfDocument(pdf_path)
+    images = []
+    for page_num in range(len(pdf)):
+        page = pdf[page_num]
+        # Render page as image
+        bitmap = page.render(scale=2.0)
+        pil_image = bitmap.to_pil()
+        # Save or convert to base64
+        buffered = io.BytesIO()
+        pil_image.save(buffered, format="PNG")
+        img_str = base64.b64encode(buffered.getvalue()).decode()
+        images.append({
+            'page': page_num + 1,
+            'base64': img_str,
+            'url': f'data:image/png;base64,{img_str}'
+        })
+    return images
+```
+---
+## Troubleshooting
+### Images không được detect
+**Nguyên nhân:**
+- URLs không đúng format (thiếu http://)
+- URLs bị line break
+- Markdown syntax sai
+**Giải pháp:**
+```python
+# Test URL detection
+from multimodal_pdf_parser import MultimodalPDFParser
+parser = MultimodalPDFParser()
+test_text = """
+Xem hình: https://example.com/image.png
+![Alt](https://example.com/pic.jpg)
+"""
+urls = parser.extract_image_urls(test_text)
+print("Found URLs:", urls)
+```
+### Chatbot không return images
+**Check:**
+1. Verify PDF đã được index với multimodal parser:
+   ```bash
+   curl http://localhost:8000/documents/pdf
+   # Look for "type": "multimodal_pdf"
+   ```
+2. Check metadata có `image_urls`:
+   ```python
+   response = requests.post('http://localhost:8000/chat', ...)
+   for ctx in response.json()['context_used']:
+       print(ctx['metadata'].get('image_urls', []))
+   ```
+### Images quá nhiều → chunks lớn
+**Solution:** Giảm số images mỗi chunk:
+```python
+# In multimodal_pdf_parser.py
+parser = MultimodalPDFParser(
+    chunk_size=300,      # Smaller chunks
+    chunk_overlap=30,
+    extract_images=True
+)
+```
+---
+## Kết Luận
+### Khi Nào Dùng Multimodal PDF?
+✓ **Sử dụng `/upload-pdf-multimodal` khi:**
+- PDF có hình ảnh minh họa (screenshots, diagrams)
+- Cần chatbot reference hình ảnh khi trả lời
+- User guides, tutorials với visual instructions
+- Documentation với charts, tables as images
+✓ **Sử dụng `/upload-pdf` thường khi:**
+- PDF chỉ có text thuần
+- Không cần images trong context
+- Simple documents, FAQs
+### Workflow Hoàn Chỉnh
+1. **Tạo PDF** với text + image URLs (Markdown/HTML)
+2. **Upload** qua `/upload-pdf-multimodal`
+3. **Verify** images đã được detect
+4. **Chat** - images sẽ tự động được include in context
+5. **Display** images trong UI của bạn
+---
+## Example: Full Workflow
+```python
+"""
+Complete workflow: Create, upload, and chat with multimodal PDF
+"""
+import requests
+# 1. Upload multimodal PDF
+print("=== Uploading Multimodal PDF ===")
+with open('user_guide_with_images.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf-multimodal',
+        files={'file': f},
+        data={'title': 'User Guide', 'category': 'guide'}
+    )
+result = response.json()
+print(f"✓ Indexed: {result['chunks_indexed']} chunks")
+print(f"✓ Message: {result['message']}")
+# 2. Chat with multimodal context
+print("\n=== Chatting ===")
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới? Cho tôi xem hình minh họa.',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 3,
+    'hf_token': 'your_token'
+})
+chat_result = response.json()
+print(f"Answer: {chat_result['response']}\n")
+# 3. Display context with images
+print("=== Context with Images ===")
+for i, ctx in enumerate(chat_result['context_used'], 1):
+    print(f"\n[{i}] Page {ctx['metadata']['page']}, Confidence: {ctx['confidence']:.2%}")
+    print(f"Text: {ctx['metadata']['text'][:150]}...")
+    if ctx['metadata'].get('has_images'):
+        print(f"Images ({ctx['metadata']['num_images']}):")
+        for url in ctx['metadata']['image_urls']:
+            print(f"  🖼️ {url}")
+```
+---
+**Bây giờ PDF của bạn có hình ảnh minh họa sẽ work perfectly! 🎨📄**

PDF_RAG_GUIDE.md ADDED Viewed

	@@ -0,0 +1,390 @@

+# Hướng Dẫn Sử Dụng PDF với ChatbotRAG
+## Tổng Quan
+Hệ thống ChatbotRAG hiện đã hỗ trợ **tải lên và index PDF** để chatbot có thể trả lời câu hỏi dựa trên nội dung trong PDF. Điều này rất hữu ích cho:
+- Hướng dẫn sử dụng sản phẩm
+- Tài liệu FAQ
+- Chính sách, quy định
+- Tài liệu kỹ thuật
+## Cách Thức Hoạt Động
+1. **Upload PDF** → Hệ thống parse PDF thành text
+2. **Chunking** → Text được chia thành các chunks (mặc định: 500 words/chunk, overlap 50 words)
+3. **Embedding** → Mỗi chunk được convert thành vector embedding
+4. **Indexing** → Lưu vào Qdrant + MongoDB
+5. **Chat** → Chatbot tìm kiếm chunks liên quan và trả lời câu hỏi
+## Cách 1: Upload PDF Qua API
+### Endpoint: `POST /upload-pdf`
+**Request:**
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@huong_dan_su_dung.pdf" \
+  -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+  -F "description=Tài liệu hướng dẫn đầy đủ về ChatbotRAG" \
+  -F "category=user_guide"
+```
+**Python:**
+```python
+import requests
+with open('huong_dan_su_dung.pdf', 'rb') as f:
+    files = {'file': f}
+    data = {
+        'title': 'Hướng dẫn sử dụng ChatbotRAG',
+        'description': 'Tài liệu hướng dẫn đầy đủ',
+        'category': 'user_guide'
+    }
+    response = requests.post(
+        'http://localhost:8000/upload-pdf',
+        files=files,
+        data=data
+    )
+    print(response.json())
+```
+**Response:**
+```json
+{
+  "success": true,
+  "document_id": "pdf_20251029_143022",
+  "filename": "huong_dan_su_dung.pdf",
+  "chunks_indexed": 45,
+  "message": "PDF 'huong_dan_su_dung.pdf' đã được index thành công với 45 chunks"
+}
+```
+### Tham Số:
+- `file` (required): File PDF
+- `document_id` (optional): ID tùy chỉnh, mặc định auto-generate
+- `title` (optional): Tiêu đề tài liệu
+- `description` (optional): Mô tả
+- `category` (optional): Danh mục (user_guide, faq, policy, etc.)
+## Cách 2: Batch Index Nhiều PDFs
+Nếu bạn có nhiều PDF files, sử dụng script batch:
+```bash
+# Index tất cả PDFs trong thư mục
+python batch_index_pdfs.py ./docs/user_guides
+# Với category tùy chỉnh
+python batch_index_pdfs.py ./docs/policies --category=policy
+# Force reindex (ghi đè nếu đã có)
+python batch_index_pdfs.py ./docs/faq --category=faq --force
+```
+Script sẽ tự động:
+- Scan tất cả file .pdf trong thư mục
+- Index từng file với metadata phù hợp
+- Skip những file đã index (trừ khi dùng --force)
+- Hiển thị progress và summary
+## Quản Lý PDF Documents
+### Xem Danh Sách PDFs
+```bash
+curl http://localhost:8000/documents/pdf
+```
+**Response:**
+```json
+{
+  "documents": [
+    {
+      "document_id": "pdf_user_guide",
+      "type": "pdf",
+      "filename": "huong_dan_su_dung.pdf",
+      "num_chunks": 45,
+      "metadata": {
+        "title": "Hướng dẫn sử dụng",
+        "category": "user_guide"
+      }
+    }
+  ],
+  "total": 1
+}
+```
+### Xóa PDF Document
+```bash
+# Xóa document và tất cả chunks của nó
+curl -X DELETE http://localhost:8000/documents/pdf/pdf_user_guide
+```
+## Chat Với PDF Content
+Sau khi index PDF, bạn có thể chat như bình thường:
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để upload PDF vào ChatbotRAG?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 5,
+    'hf_token': 'your_hf_token'
+})
+result = response.json()
+print("Answer:", result['response'])
+# Xem sources
+for ctx in result['context_used']:
+    print(f"- Page {ctx['metadata']['page']}: {ctx['metadata']['text'][:100]}...")
+```
+Chatbot sẽ tự động tìm kiếm trong PDF và trả lời dựa trên nội dung đã index.
+## Tạo PDF Hướng Dẫn Sử Dụng
+### Template Nội Dung
+Dưới đây là cấu trúc đề xuất cho PDF hướng dẫn ChatbotRAG:
+```
+HƯỚNG DẪN SỬ DỤNG CHATBOTRAG
+1. GIỚI THIỆU
+   - ChatbotRAG là gì?
+   - Tính năng chính
+   - Use cases
+2. BẮT ĐẦU NHANH
+   2.1. Cài đặt
+   2.2. Khởi động server
+   2.3. Truy cập API
+3. INDEX DỮ LIỆU
+   3.1. Index text đơn giản
+   3.2. Index với images
+   3.3. Index nhiều texts và images cùng lúc
+   3.4. Upload PDF
+4. TÌM KIẾM
+   4.1. Search bằng text
+   4.2. Search bằng image
+   4.3. Hybrid search
+5. CHAT VỚI CHATBOT
+   5.1. Chat cơ bản
+   5.2. Chat với RAG
+   5.3. Advanced RAG options
+   5.4. Tùy chỉnh LLM parameters
+6. QUẢN LÝ DOCUMENTS
+   6.1. Xem danh sách documents
+   6.2. Xóa documents
+   6.3. Quản lý PDF files
+7. CÂU HỎI THƯỜNG GẶP (FAQ)
+   - Làm sao để upload PDF?
+   - Chatbot không tìm thấy thông tin?
+   - Làm sao để cải thiện độ chính xác?
+   - Token limit là bao nhiêu?
+8. API REFERENCE
+   - POST /index
+   - POST /search
+   - POST /chat
+   - POST /upload-pdf
+   - GET /documents/pdf
+```
+### Tạo PDF Từ Markdown
+Bạn có thể tạo PDF từ Markdown bằng nhiều tools:
+**1. Pandoc (Recommended):**
+```bash
+pandoc guide.md -o guide.pdf --pdf-engine=xelatex
+```
+**2. Online Tools:**
+- https://www.markdowntopdf.com/
+- https://md2pdf.netlify.app/
+**3. VS Code Extension:**
+- Install "Markdown PDF" extension
+- Right-click file .md → "Markdown PDF: Export (pdf)"
+### Ví Dụ Markdown Content
+Tạo file `chatbot_guide.md`:
+```markdown
+# Hướng Dẫn Sử Dụng ChatbotRAG
+## 1. Upload PDF
+Để upload PDF vào hệ thống:
+### Bước 1: Chuẩn bị file PDF
+- File phải có định dạng .pdf
+- Nội dung nên rõ ràng, có cấu trúc
+### Bước 2: Upload qua API
+\`\`\`bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@your_file.pdf" \
+  -F "title=Tên tài liệu"
+\`\`\`
+### Bước 3: Kiểm tra
+Sau khi upload, hệ thống sẽ trả về số chunks đã được index.
+## 2. Chat Với Chatbot
+Sau khi upload PDF, bạn có thể hỏi chatbot:
+**Ví dụ:**
+- "Làm sao để upload PDF?"
+- "Các bước tạo event là gì?"
+- "Tính năng nào trong hệ thống?"
+Chatbot sẽ tìm kiếm trong PDF và trả lời dựa trên nội dung đã index.
+## 3. FAQ
+### Câu hỏi 1: Upload PDF tối đa bao nhiêu trang?
+Không giới hạn, nhưng PDF càng lớn thì thời gian index càng lâu.
+### Câu hỏi 2: Có thể upload nhiều PDFs không?
+Có, bạn có thể upload nhiều PDFs. Mỗi PDF sẽ có document_id riêng.
+### Câu hỏi 3: Làm sao để xóa PDF đã upload?
+Sử dụng endpoint DELETE /documents/pdf/{document_id}
+```
+Sau đó convert sang PDF:
+```bash
+pandoc chatbot_guide.md -o chatbot_guide.pdf
+```
+## Best Practices
+### 1. Cấu Trúc PDF
+- ✓ Có tiêu đề rõ ràng
+- ✓ Chia sections/chapters
+- ✓ Sử dụng bullet points
+- ✓ Tránh quá nhiều hình ảnh phức tạp (text extraction khó)
+### 2. Nội Dung
+- ✓ Viết câu ngắn gọn, dễ hiểu
+- ✓ Mỗi section tập trung 1 chủ đề
+- ✓ Có ví dụ cụ thể
+- ✗ Tránh văn xuôi dài, khó tách câu
+### 3. Metadata
+- Luôn đặt `title` rõ ràng
+- Sử dụng `category` để phân loại
+- Thêm `description` cho dễ quản lý
+### 4. Chunking
+Mặc định:
+- Chunk size: 500 words
+- Overlap: 50 words
+Có thể tùy chỉnh trong `pdf_parser.py`:
+```python
+parser = PDFParser(
+    chunk_size=500,      # Tăng nếu muốn context dài hơn
+    chunk_overlap=50,    # Tăng để giữ context tốt hơn
+    min_chunk_size=50    # Min words cho 1 chunk
+)
+```
+## Troubleshooting
+### Lỗi: "Error reading PDF"
+- Kiểm tra file PDF có bị corrupt không
+- Thử mở bằng PDF reader để verify
+- Convert lại PDF nếu cần
+### Lỗi: "No text extracted"
+- PDF có thể là scanned images (không có text layer)
+- Cần OCR trước khi index (dùng tools như Tesseract)
+### Chatbot không tìm thấy thông tin
+- Kiểm tra `score_threshold` - thử giảm xuống (e.g., 0.3)
+- Tăng `top_k` để retrieve nhiều documents hơn
+- Rephrase câu hỏi
+### Chunks quá ngắn/dài
+- Điều chỉnh `chunk_size` trong `pdf_parser.py`
+- Reindex PDF với settings mới
+## Complete Example
+```python
+# 1. Upload PDF
+import requests
+with open('user_guide.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf',
+        files={'file': f},
+        data={
+            'title': 'Hướng dẫn sử dụng',
+            'category': 'user_guide'
+        }
+    )
+doc_id = response.json()['document_id']
+print(f"Uploaded: {doc_id}")
+# 2. List PDFs
+response = requests.get('http://localhost:8000/documents/pdf')
+print(response.json())
+# 3. Chat
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để tạo event mới?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'hf_token': 'your_token'
+})
+print("Answer:", response.json()['response'])
+# 4. Delete PDF (if needed)
+response = requests.delete(f'http://localhost:8000/documents/pdf/{doc_id}')
+print(response.json())
+```
+## Next Steps
+1. **Tạo PDF hướng dẫn của bạn** với nội dung về hệ thống của bạn
+2. **Upload PDF** vào hệ thống
+3. **Test chatbot** - hỏi các câu hỏi về nội dung trong PDF
+4. **Fine-tune** - điều chỉnh parameters nếu cần
+5. **Add more PDFs** - thêm FAQs, policies, etc.
+## Support
+Nếu có vấn đề, check:
+- Server logs để xem errors
+- MongoDB để xem documents đã được lưu chưa
+- Qdrant collection để verify chunks đã được index
+## Conclusion
+Hệ thống PDF RAG giúp chatbot của bạn trả lời câu hỏi dựa trên tài liệu có sẵn, không cần train lại model. Bạn chỉ cần:
+1. Upload PDF
+2. Chat như bình thường
+3. Chatbot sẽ tìm kiếm và trả lời dựa trên PDF content
+Đơn giản và hiệu quả!

QUICK_START_PDF.md ADDED Viewed

	@@ -0,0 +1,310 @@

+# Quick Start: PDF-Based ChatbotRAG
+## Tóm Tắt Nhanh
+Bây giờ bạn có thể:
+1. **Upload PDF** hướng dẫn sử dụng vào hệ thống
+2. **Chatbot tự động trả lời** các câu hỏi dựa trên nội dung trong PDF
+3. Không cần train model, chỉ cần upload PDF!
+---
+## Quy Trình Hoàn Chỉnh
+### Bước 1: Tạo PDF Hướng Dẫn
+Bạn có 2 cách:
+**Cách 1: Sử dụng Template Có Sẵn**
+File `chatbot_guide_template.md` đã sẵn sàng. Customize nội dung cho hệ thống của bạn, sau đó convert sang PDF:
+```bash
+# Cài pandoc (nếu chưa có)
+# Windows: choco install pandoc
+# Mac: brew install pandoc
+# Linux: sudo apt-get install pandoc
+# Convert markdown to PDF
+pandoc chatbot_guide_template.md -o chatbot_user_guide.pdf --pdf-engine=xelatex
+```
+**Cách 2: Tự Viết Content**
+Tạo file Word/Google Docs với nội dung hướng dẫn, sau đó:
+- File → Export → PDF
+**Nội dung nên bao gồm:**
+- Giới thiệu hệ thống
+- Các chức năng chính
+- Hướng dẫn sử dụng từng tính năng
+- FAQ (Câu hỏi thường gặp)
+- Examples
+### Bước 2: Upload PDF Vào Hệ Thống
+```bash
+# Khởi động server
+cd ChatbotRAG
+python main.py
+```
+Trong terminal khác:
+```bash
+# Upload PDF
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@chatbot_user_guide.pdf" \
+  -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+  -F "description=Tài liệu hướng dẫn đầy đủ" \
+  -F "category=user_guide"
+```
+Hoặc dùng Python:
+```python
+import requests
+with open('chatbot_user_guide.pdf', 'rb') as f:
+    response = requests.post(
+        'http://localhost:8000/upload-pdf',
+        files={'file': f},
+        data={
+            'title': 'Hướng dẫn sử dụng ChatbotRAG',
+            'category': 'user_guide'
+        }
+    )
+print(response.json())
+# Output: {"success": true, "document_id": "pdf_...", "chunks_indexed": 45}
+```
+### Bước 3: Verify Upload
+```bash
+# Xem danh sách PDFs
+curl http://localhost:8000/documents/pdf
+```
+### Bước 4: Chat!
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Làm sao để upload PDF vào ChatbotRAG?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 5,
+    'hf_token': 'your_huggingface_token'  # Get from https://huggingface.co/settings/tokens
+})
+result = response.json()
+print("Answer:", result['response'])
+print("\nSources:")
+for ctx in result['context_used']:
+    print(f"- Page {ctx['metadata']['page']}: Confidence {ctx['confidence']:.2%}")
+```
+---
+## Test Script Mẫu
+File `test_pdf_chatbot.py`:
+```python
+"""
+Test PDF-based chatbot
+"""
+import requests
+import time
+BASE_URL = "http://localhost:8000"
+HF_TOKEN = "your_huggingface_token"  # Replace with your token
+def upload_pdf():
+    """Upload PDF guide"""
+    print("=== Uploading PDF ===")
+    with open('chatbot_user_guide.pdf', 'rb') as f:
+        response = requests.post(
+            f'{BASE_URL}/upload-pdf',
+            files={'file': f},
+            data={
+                'title': 'ChatbotRAG User Guide',
+                'category': 'user_guide'
+            }
+        )
+    result = response.json()
+    print(f"✓ Uploaded: {result['chunks_indexed']} chunks")
+    return result['document_id']
+def chat(question):
+    """Ask chatbot"""
+    print(f"\n=== Question: {question} ===")
+    response = requests.post(f'{BASE_URL}/chat', json={
+        'message': question,
+        'use_rag': True,
+        'use_advanced_rag': True,
+        'top_k': 5,
+        'hf_token': HF_TOKEN
+    })
+    result = response.json()
+    print(f"Answer: {result['response']}\n")
+    print(f"Retrieved {len(result['context_used'])} documents:")
+    for i, ctx in enumerate(result['context_used'], 1):
+        print(f"{i}. Page {ctx['metadata'].get('page')}, Confidence: {ctx['confidence']:.2%}")
+def main():
+    # 1. Upload PDF
+    doc_id = upload_pdf()
+    # Wait for indexing to complete
+    time.sleep(2)
+    # 2. Test questions
+    questions = [
+        "Làm sao để upload PDF vào hệ thống?",
+        "Chatbot có support tiếng Việt không?",
+        "Tối đa bao nhiêu texts có thể index cùng lúc?",
+        "Advanced RAG có những tính năng gì?"
+    ]
+    for q in questions:
+        chat(q)
+        time.sleep(1)
+if __name__ == "__main__":
+    main()
+```
+Chạy:
+```bash
+python test_pdf_chatbot.py
+```
+---
+## Upload Nhiều PDFs Cùng Lúc
+Nếu bạn có nhiều PDFs (FAQ, User Guide, Policies, etc.):
+```bash
+# Đặt tất cả PDFs vào thư mục
+mkdir docs
+# Copy PDFs vào docs/
+# Batch index
+python batch_index_pdfs.py ./docs --category=user_guide
+```
+Script sẽ tự động index tất cả PDFs và skip những file đã có.
+---
+## Câu Hỏi Test Mẫu
+Sau khi upload PDF hướng dẫn, test với các câu hỏi:
+**Về tính năng:**
+- "ChatbotRAG có những tính năng gì?"
+- "Làm sao để index dữ liệu?"
+- "Advanced RAG là gì?"
+**Hướng dẫn sử dụng:**
+- "Làm sao để upload PDF?"
+- "Cách chat với chatbot như thế nào?"
+- "Làm sao để xem lịch sử chat?"
+**FAQ:**
+- "Chatbot không tìm thấy thông tin phải làm sao?"
+- "Tối đa bao nhiêu images có thể upload?"
+- "Token limit là bao nhiêu?"
+**Technical:**
+- "Score threshold là gì?"
+- "Top_k trong chat request có ý nghĩa gì?"
+- "Làm sao để cải thiện độ chính xác?"
+---
+## Tips Để Chatbot Trả Lời Tốt
+### 1. PDF Content Quality
+- Viết rõ ràng, có cấu trúc
+- Mỗi section tập trung 1 topic
+- Có examples cụ thể
+- FAQ với câu hỏi thực tế
+### 2. Chat Settings
+```python
+{
+    'use_advanced_rag': True,      # Luôn bật
+    'use_reranking': True,          # Rerank cho accuracy
+    'use_compression': True,        # Nén context
+    'score_threshold': 0.5,         # 0.4-0.6 là tốt
+    'top_k': 5,                     # 3-7 tùy use case
+    'temperature': 0.3              # Thấp cho factual answers
+}
+```
+### 3. Query Tips
+- Hỏi câu rõ ràng, cụ thể
+- Tránh câu hỏi quá chung chung
+- Nếu không tìm thấy, rephrase câu hỏi
+---
+## Monitoring
+### Check Index Status
+```bash
+curl http://localhost:8000/stats
+```
+### View PDFs
+```bash
+curl http://localhost:8000/documents/pdf
+```
+### Check Chat History
+```bash
+curl "http://localhost:8000/history?limit=10"
+```
+---
+## Kết Luận
+Bây giờ bạn có thể:
+✓ Tạo PDF hướng dẫn với nội dung của bạn
+✓ Upload PDF vào hệ thống trong vài giây
+✓ Chatbot tự động trả lời dựa trên PDF content
+✓ Không cần train, không cần code phức tạp
+✓ Update content? Chỉ cần upload PDF mới!
+**Next Steps:**
+1. Tạo PDF hướng dẫn của bạn (hoặc customize template)
+2. Upload vào hệ thống
+3. Test với câu hỏi thực tế
+4. Fine-tune settings nếu cần
+5. Add thêm PDFs (FAQ, policies, etc.)
+---
+## Files Quan Trọng
+- `pdf_parser.py` - PDF parsing engine
+- `batch_index_pdfs.py` - Batch indexing script
+- `chatbot_guide_template.md` - Template PDF content
+- `PDF_RAG_GUIDE.md` - Chi tiết về PDF RAG
+- `ADVANCED_RAG_GUIDE.md` - Advanced RAG features
+---
+**Chúc bạn thành công! 🚀**

SUMMARY.md ADDED Viewed

	@@ -0,0 +1,429 @@

+# ChatbotRAG - Complete Summary
+## Tổng Quan Hệ Thống
+Hệ thống ChatbotRAG hiện đã được nâng cấp toàn diện với các tính năng advanced:
+### ✨ Tính Năng Chính
+1. **Multiple Inputs Support** (/index)
+   - Index tối đa 10 texts + 10 images cùng lúc
+   - Average embeddings tự động
+2. **Advanced RAG Pipeline** (/chat)
+   - Query Expansion
+   - Multi-Query Retrieval
+   - Reranking with semantic similarity
+   - Contextual Compression
+   - Better Prompt Engineering
+3. **PDF Support** (/upload-pdf)
+   - Parse PDF thành chunks
+   - Auto chunking với overlap
+   - Index vào RAG system
+4. **Multimodal PDF** (/upload-pdf-multimodal) ⭐ NEW
+   - Extract text + image URLs từ PDF
+   - Link images với text chunks
+   - Return images cùng text trong chat
+   - Perfect cho user guides với screenshots
+---
+## Kiến Trúc Hệ Thống
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    FastAPI Application                       │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │   Indexing   │  │   Search     │  │   Chat       │      │
+│  │   Endpoints  │  │   Endpoints  │  │   Endpoint   │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+│                                                               │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │            Advanced RAG Pipeline                      │   │
+│  │  • Query Expansion                                    │   │
+│  │  • Multi-Query Retrieval                              │   │
+│  │  • Reranking                                          │   │
+│  │  • Contextual Compression                             │   │
+│  └──────────────────────────────────────────────────────┘   │
+│                                                               │
+├─────────────────────────────────────────────────────────────┤
+│                                                               │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
+│  │   Jina CLIP  │  │   Qdrant     │  │   MongoDB    │      │
+│  │   v2         │  │   Vector DB  │  │   Documents  │      │
+│  └──────────────┘  └──────────────┘  └──────────────┘      │
+│                                                               │
+│  ┌──────────────┐  ┌──────────────┐                         │
+│  │   PDF        │  │  Multimodal  │                         │
+│  │   Parser     │  │  PDF Parser  │                         │
+│  └──────────────┘  └──────────────┘                         │
+│                                                               │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## Files Quan Trọng
+### Core System
+- **main.py** - FastAPI application với tất cả endpoints
+- **embedding_service.py** - Jina CLIP v2 embedding
+- **qdrant_service.py** - Qdrant vector DB operations
+- **advanced_rag.py** - Advanced RAG pipeline
+### PDF Processing
+- **pdf_parser.py** - Basic PDF parser (text only)
+- **multimodal_pdf_parser.py** - Multimodal PDF parser (text + images)
+- **batch_index_pdfs.py** - Batch indexing script
+### Documentation
+- **ADVANCED_RAG_GUIDE.md** - Advanced RAG features guide
+- **PDF_RAG_GUIDE.md** - PDF usage guide
+- **MULTIMODAL_PDF_GUIDE.md** - Multimodal PDF guide ⭐
+- **QUICK_START_PDF.md** - Quick start for PDF
+- **chatbot_guide_template.md** - Template for user guide PDF
+### Testing
+- **test_advanced_features.py** - Test advanced features
+- **test_pdf_chatbot.py** - Test PDF chatbot (example in docs)
+---
+## API Endpoints
+### 1. Indexing
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/index` | POST | Index texts + images (max 10 each) |
+| `/documents` | POST | Add text document |
+| `/upload-pdf` | POST | Upload PDF (text only) |
+| `/upload-pdf-multimodal` | POST | Upload PDF with images ⭐ |
+### 2. Search
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/search` | POST | Hybrid search (text + image) |
+| `/search/text` | POST | Text-only search |
+| `/search/image` | POST | Image-only search |
+| `/rag/search` | POST | RAG knowledge base search |
+### 3. Chat
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/chat` | POST | Chat with Advanced RAG |
+### 4. Management
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/documents/pdf` | GET | List all PDFs |
+| `/documents/pdf/{id}` | DELETE | Delete PDF document |
+| `/delete/{doc_id}` | DELETE | Delete document |
+| `/document/{doc_id}` | GET | Get document by ID |
+| `/history` | GET | Get chat history |
+| `/stats` | GET | Collection statistics |
+| `/` | GET | Health check + API docs |
+---
+## Use Cases & Recommendations
+### Case 1: PDF Hướng Dẫn Chỉ Có Text
+**Scenario:** FAQ, policy document, text guide
+**Solution:** `/upload-pdf`
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "[email protected]" \
+  -F "title=FAQ"
+```
+### Case 2: PDF Hướng Dẫn Có Hình Ảnh ⭐ (Your Case)
+**Scenario:** User guide với screenshots, tutorial với diagrams
+**Solution:** `/upload-pdf-multimodal`
+```bash
+curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+  -F "file=@user_guide_with_images.pdf" \
+  -F "title=User Guide" \
+  -F "category=guide"
+```
+**Benefits:**
+- ✓ Extract text + image URLs
+- ✓ Link images với text chunks
+- ✓ Chatbot return images in response
+- ✓ Visual context for users
+### Case 3: Multiple Social Media Posts
+**Scenario:** Index nhiều posts với texts và images
+**Solution:** `/index` with multiple inputs
+```python
+data = {
+    'id': 'post123',
+    'texts': ['Post text 1', 'Post text 2', ...],  # Max 10
+}
+files = [
+    ('images', open('img1.jpg', 'rb')),
+    ('images', open('img2.jpg', 'rb')),  # Max 10
+]
+requests.post('http://localhost:8000/index', data=data, files=files)
+```
+### Case 4: Complex Queries
+**Scenario:** Câu hỏi phức tạp, cần độ chính xác cao
+**Solution:** Advanced RAG with full options
+```python
+{
+    'message': 'Complex question',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'use_reranking': True,
+    'use_compression': True,
+    'score_threshold': 0.5,
+    'top_k': 5
+}
+```
+---
+## Workflow Đề Xuất Cho Bạn
+### Setup Ban Đầu
+1. **Tạo PDF hướng dẫn sử dụng**
+   - Dùng template: `chatbot_guide_template.md`
+   - Customize nội dung cho hệ thống của bạn
+   - Thêm image URLs (screenshots, diagrams)
+   - Convert to PDF: `pandoc template.md -o guide.pdf`
+2. **Upload PDF**
+   ```bash
+   curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+     -F "file=@chatbot_user_guide.pdf" \
+     -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+     -F "category=user_guide"
+   ```
+3. **Verify**
+   ```bash
+   curl http://localhost:8000/documents/pdf
+   # Check "type": "multimodal_pdf" và "total_images"
+   ```
+### Sử Dụng Hàng Ngày
+1. **Chat với user**
+   ```python
+   response = requests.post('http://localhost:8000/chat', json={
+       'message': user_question,
+       'use_rag': True,
+       'use_advanced_rag': True,
+       'hf_token': 'your_token'
+   })
+   ```
+2. **Display response + images**
+   ```python
+   # Text answer
+   print(response.json()['response'])
+   # Images (if any)
+   for ctx in response.json()['context_used']:
+       if ctx['metadata'].get('has_images'):
+           for url in ctx['metadata']['image_urls']:
+               # Display image in your UI
+               print(f"Image: {url}")
+   ```
+### Cập Nhật Content
+1. **Update PDF** - Edit và re-export
+2. **Xóa PDF cũ**
+   ```bash
+   curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id
+   ```
+3. **Upload PDF mới**
+   ```bash
+   curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"
+   ```
+---
+## Performance Tips
+### 1. Chunking
+**Default:**
+- chunk_size: 500 words
+- chunk_overlap: 50 words
+**Tối ưu:**
+```python
+# In multimodal_pdf_parser.py
+parser = MultimodalPDFParser(
+    chunk_size=400,      # Shorter for faster retrieval
+    chunk_overlap=40,
+    min_chunk_size=50
+)
+```
+### 2. Retrieval
+**Settings tốt:**
+```python
+{
+    'top_k': 5,              # 3-7 is optimal
+    'score_threshold': 0.5,   # 0.4-0.6 is good
+    'use_reranking': True,    # Always enable
+    'use_compression': True   # Keeps context relevant
+}
+```
+### 3. LLM
+**For factual answers:**
+```python
+{
+    'temperature': 0.3,   # Low for accuracy
+    'max_tokens': 512,    # Concise answers
+    'top_p': 0.9
+}
+```
+---
+## Troubleshooting
+### Issue 1: Images không được detect
+**Solution:**
+- Verify PDF có image URLs (http://, https://)
+- Check format: markdown `![](url)` hoặc HTML `<img src>`
+- Test regex:
+  ```python
+  from multimodal_pdf_parser import MultimodalPDFParser
+  parser = MultimodalPDFParser()
+  urls = parser.extract_image_urls("![](https://example.com/img.png)")
+  print(urls)  # Should return ['https://example.com/img.png']
+  ```
+### Issue 2: Chatbot không tìm thấy thông tin
+**Solution:**
+- Lower score_threshold: `0.3-0.5`
+- Increase top_k: `5-10`
+- Enable Advanced RAG
+- Rephrase question
+### Issue 3: Response quá chậm
+**Solution:**
+- Giảm top_k
+- Disable compression nếu không cần
+- Use basic RAG thay vì advanced for simple queries
+---
+## Next Steps
+### Immediate (Bây Giờ)
+1. ✓ System đã ready!
+2. Tạo PDF hướng dẫn của bạn
+3. Upload qua `/upload-pdf-multimodal`
+4. Test với câu hỏi thực tế
+### Short Term (1-2 tuần)
+1. Collect user feedback
+2. Fine-tune parameters (top_k, threshold)
+3. Add more PDFs (FAQ, tutorials, etc.)
+4. Monitor chat history để improve content
+### Long Term (Sau này)
+1. **Hybrid Search với BM25**
+   - Combine dense + sparse retrieval
+   - Better for keyword queries
+2. **Cross-Encoder Reranking**
+   - Replace embedding similarity
+   - More accurate ranking
+3. **Image Processing**
+   - Download và process actual images
+   - Use Jina CLIP for image embeddings
+   - True multimodal embeddings (text + image vectors)
+4. **RAG-Anything Integration** (Nếu cần)
+   - For complex PDFs with tables, charts
+   - Vision encoder for embedded images
+   - Advanced document understanding
+---
+## Comparison Matrix
+| Approach | Text | Images | URLs | Complexity | Your Case |
+|----------|------|--------|------|------------|-----------|
+| Basic RAG | ✓ | ✗ | ✗ | Low | ✗ |
+| PDF Parser | ✓ | ✗ | ✗ | Low | ✗ |
+| **Multimodal PDF** | ✓ | ✗ | ✓ | **Medium** | **✓** |
+| RAG-Anything | ✓ | ✓ | ✓ | High | Overkill |
+**Recommendation:** **Multimodal PDF** là perfect cho case của bạn!
+---
+## Kết Luận
+### Bạn Có Gì?
+✅ **Multiple Inputs**: Index 10 texts + 10 images
+✅ **Advanced RAG**: Query expansion, reranking, compression
+✅ **PDF Support**: Parse và index PDFs
+✅ **Multimodal PDF**: Extract text + image URLs, link together
+✅ **Complete Documentation**: Guides, examples, troubleshooting
+### Làm Gì Tiếp?
+1. **Tạo PDF** hướng dẫn với nội dung của bạn (có image URLs)
+2. **Upload** qua `/upload-pdf-multimodal`
+3. **Test** với câu hỏi thực tế
+4. **Iterate** - fine-tune based on feedback
+### Files Cần Đọc
+**Cho PDF với hình ảnh (Your case):**
+- [MULTIMODAL_PDF_GUIDE.md](MULTIMODAL_PDF_GUIDE.md) ⭐⭐⭐
+- [PDF_RAG_GUIDE.md](PDF_RAG_GUIDE.md)
+**Cho Advanced RAG:**
+- [ADVANCED_RAG_GUIDE.md](ADVANCED_RAG_GUIDE.md)
+**Quick Start:**
+- [QUICK_START_PDF.md](QUICK_START_PDF.md)
+---
+**Hệ thống của bạn bây giờ rất mạnh! Chỉ cần upload PDF và chat thôi! 🚀📄🤖**

advanced_rag.py ADDED Viewed

	@@ -0,0 +1,301 @@

+"""
+Advanced RAG techniques for improved retrieval and generation
+Includes: Query Expansion, Reranking, Contextual Compression, Hybrid Search
+"""
+from typing import List, Dict, Optional, Tuple
+import numpy as np
+from dataclasses import dataclass
+import re
+@dataclass
+class RetrievedDocument:
+    """Document retrieved from vector database"""
+    id: str
+    text: str
+    confidence: float
+    metadata: Dict
+class AdvancedRAG:
+    """Advanced RAG system with modern techniques"""
+    def __init__(self, embedding_service, qdrant_service):
+        self.embedding_service = embedding_service
+        self.qdrant_service = qdrant_service
+    def expand_query(self, query: str) -> List[str]:
+        """
+        Expand query with related terms and variations
+        Simple rule-based expansion for Vietnamese queries
+        """
+        queries = [query]
+        # Add query variations
+        # Remove question words for alternative search
+        question_words = ['ai', 'gì', 'nào', 'đâu', 'khi nào', 'như thế nào',
+                         'tại sao', 'có', 'là', 'được', 'không']
+        query_lower = query.lower()
+        for qw in question_words:
+            if qw in query_lower:
+                variant = query_lower.replace(qw, '').strip()
+                if variant and variant != query_lower:
+                    queries.append(variant)
+        # Extract key nouns/phrases (simple approach)
+        words = query.split()
+        if len(words) > 3:
+            # Take important words (skip first question word)
+            key_phrases = ' '.join(words[1:]) if words[0].lower() in question_words else ' '.join(words[:3])
+            if key_phrases not in queries:
+                queries.append(key_phrases)
+        return queries[:3]  # Return top 3 variations
+    def multi_query_retrieval(
+        self,
+        query: str,
+        top_k: int = 5,
+        score_threshold: float = 0.5
+    ) -> List[RetrievedDocument]:
+        """
+        Retrieve documents using multiple query variations
+        Combines results from all query variations
+        """
+        expanded_queries = self.expand_query(query)
+        all_results = {}  # Use dict to deduplicate by doc_id
+        for q in expanded_queries:
+            # Generate embedding for each query variant
+            query_embedding = self.embedding_service.encode_text(q)
+            # Search in Qdrant
+            results = self.qdrant_service.search(
+                query_embedding=query_embedding,
+                limit=top_k,
+                score_threshold=score_threshold
+            )
+            # Add to results (keep highest score for duplicates)
+            for result in results:
+                doc_id = result["id"]
+                if doc_id not in all_results or result["confidence"] > all_results[doc_id].confidence:
+                    all_results[doc_id] = RetrievedDocument(
+                        id=doc_id,
+                        text=result["metadata"].get("text", ""),
+                        confidence=result["confidence"],
+                        metadata=result["metadata"]
+                    )
+        # Sort by confidence and return top_k
+        sorted_results = sorted(all_results.values(), key=lambda x: x.confidence, reverse=True)
+        return sorted_results[:top_k]
+    def rerank_documents(
+        self,
+        query: str,
+        documents: List[RetrievedDocument],
+        use_cross_encoder: bool = False
+    ) -> List[RetrievedDocument]:
+        """
+        Rerank documents based on semantic similarity
+        Simple reranking using embedding similarity (can be upgraded to cross-encoder)
+        """
+        if not documents:
+            return documents
+        # Simple reranking: recalculate similarity with original query
+        query_embedding = self.embedding_service.encode_text(query)
+        reranked = []
+        for doc in documents:
+            # Get document embedding
+            doc_embedding = self.embedding_service.encode_text(doc.text)
+            # Calculate cosine similarity
+            similarity = np.dot(query_embedding.flatten(), doc_embedding.flatten())
+            # Combine with original confidence (weighted average)
+            new_score = 0.6 * similarity + 0.4 * doc.confidence
+            reranked.append(RetrievedDocument(
+                id=doc.id,
+                text=doc.text,
+                confidence=float(new_score),
+                metadata=doc.metadata
+            ))
+        # Sort by new score
+        reranked.sort(key=lambda x: x.confidence, reverse=True)
+        return reranked
+    def compress_context(
+        self,
+        query: str,
+        documents: List[RetrievedDocument],
+        max_tokens: int = 500
+    ) -> List[RetrievedDocument]:
+        """
+        Compress context to most relevant parts
+        Remove redundant information and keep only relevant sentences
+        """
+        compressed_docs = []
+        for doc in documents:
+            # Split into sentences
+            sentences = self._split_sentences(doc.text)
+            # Score each sentence based on relevance to query
+            scored_sentences = []
+            query_words = set(query.lower().split())
+            for sent in sentences:
+                sent_words = set(sent.lower().split())
+                # Simple relevance: word overlap
+                overlap = len(query_words & sent_words)
+                if overlap > 0:
+                    scored_sentences.append((sent, overlap))
+            # Sort by relevance and take top sentences
+            scored_sentences.sort(key=lambda x: x[1], reverse=True)
+            # Reconstruct compressed text (up to max_tokens)
+            compressed_text = ""
+            word_count = 0
+            for sent, score in scored_sentences:
+                sent_words = len(sent.split())
+                if word_count + sent_words <= max_tokens:
+                    compressed_text += sent + " "
+                    word_count += sent_words
+                else:
+                    break
+            # If nothing selected, take original first part
+            if not compressed_text.strip():
+                compressed_text = doc.text[:max_tokens * 5]  # Rough estimate
+            compressed_docs.append(RetrievedDocument(
+                id=doc.id,
+                text=compressed_text.strip(),
+                confidence=doc.confidence,
+                metadata=doc.metadata
+            ))
+        return compressed_docs
+    def _split_sentences(self, text: str) -> List[str]:
+        """Split text into sentences (Vietnamese-aware)"""
+        # Simple sentence splitter
+        sentences = re.split(r'[.!?]+', text)
+        return [s.strip() for s in sentences if s.strip()]
+    def hybrid_rag_pipeline(
+        self,
+        query: str,
+        top_k: int = 5,
+        score_threshold: float = 0.5,
+        use_reranking: bool = True,
+        use_compression: bool = True,
+        max_context_tokens: int = 500
+    ) -> Tuple[List[RetrievedDocument], Dict]:
+        """
+        Complete advanced RAG pipeline
+        1. Multi-query retrieval
+        2. Reranking
+        3. Contextual compression
+        """
+        stats = {
+            "original_query": query,
+            "expanded_queries": [],
+            "initial_results": 0,
+            "after_rerank": 0,
+            "after_compression": 0
+        }
+        # Step 1: Multi-query retrieval
+        expanded_queries = self.expand_query(query)
+        stats["expanded_queries"] = expanded_queries
+        documents = self.multi_query_retrieval(
+            query=query,
+            top_k=top_k * 2,  # Get more candidates for reranking
+            score_threshold=score_threshold
+        )
+        stats["initial_results"] = len(documents)
+        # Step 2: Reranking (optional)
+        if use_reranking and documents:
+            documents = self.rerank_documents(query, documents)
+            documents = documents[:top_k]  # Keep top_k after reranking
+        stats["after_rerank"] = len(documents)
+        # Step 3: Contextual compression (optional)
+        if use_compression and documents:
+            documents = self.compress_context(
+                query=query,
+                documents=documents,
+                max_tokens=max_context_tokens
+            )
+        stats["after_compression"] = len(documents)
+        return documents, stats
+    def format_context_for_llm(
+        self,
+        documents: List[RetrievedDocument],
+        include_metadata: bool = True
+    ) -> str:
+        """
+        Format retrieved documents into context string for LLM
+        Uses better structure for improved LLM understanding
+        """
+        if not documents:
+            return ""
+        context_parts = ["RELEVANT CONTEXT:\n"]
+        for i, doc in enumerate(documents, 1):
+            context_parts.append(f"\n--- Document {i} (Relevance: {doc.confidence:.2%}) ---")
+            context_parts.append(doc.text)
+            if include_metadata and doc.metadata:
+                # Add useful metadata
+                meta_str = []
+                for key, value in doc.metadata.items():
+                    if key not in ['text', 'texts'] and value:
+                        meta_str.append(f"{key}: {value}")
+                if meta_str:
+                    context_parts.append(f"[Metadata: {', '.join(meta_str)}]")
+        context_parts.append("\n--- End of Context ---\n")
+        return "\n".join(context_parts)
+    def build_rag_prompt(
+        self,
+        query: str,
+        context: str,
+        system_message: str = "You are a helpful AI assistant."
+    ) -> str:
+        """
+        Build optimized RAG prompt for LLM
+        Uses best practices for prompt engineering
+        """
+        prompt_template = f"""{system_message}
+{context}
+INSTRUCTIONS:
+1. Answer the user's question using ONLY the information provided in the context above
+2. If the context doesn't contain relevant information, say "Tôi không tìm thấy thông tin liên quan trong dữ liệu."
+3. Cite relevant parts of the context when answering
+4. Be concise and accurate
+5. Answer in Vietnamese if the question is in Vietnamese
+USER QUESTION: {query}
+YOUR ANSWER:"""
+        return prompt_template

app.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""
+Hugging Face Spaces compatible app
+"""
+import os
+import gradio as gr
+from main import app as fastapi_app
+# Gradio wrapper cho Hugging Face Spaces
+def create_gradio_interface():
+    """
+    Tạo Gradio interface để deploy trên Hugging Face Spaces
+    """
+    with gr.Blocks(title="Event Social Media Embeddings API") as demo:
+        gr.Markdown("""
+        # 🔍 Event Social Media Embeddings API
+        API để embeddings và search multimodal (text + images) với **Jina CLIP v2** + **Qdrant Cloud**
+        ## 🌟 Features:
+        - ✅ Multimodal: Text + Image embeddings
+        - ✅ Tiếng Việt: 100% support
+        - ✅ High Performance: ONNX + HNSW
+        - ✅ Cloud: Qdrant Cloud
+        ## 📡 API Endpoints:
+        - `POST /index` - Index data
+        - `POST /search` - Hybrid search
+        - `POST /search/text` - Text search
+        - `POST /search/image` - Image search
+        ### 🔗 API Docs:
+        Truy cập `/docs` để xem API documentation đầy đủ
+        """)
+        gr.Markdown("### API is running at the `/docs` endpoint")
+    return demo
+# Mount FastAPI app
+demo = create_gradio_interface()
+# Wrap FastAPI với Gradio
+app = gr.mount_gradio_app(fastapi_app, demo, path="/")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

batch_index_pdfs.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""
+Batch script to index PDF files into RAG knowledge base
+Usage: python batch_index_pdfs.py <pdf_directory> [options]
+"""
+import os
+import sys
+from pathlib import Path
+from pymongo import MongoClient
+from embedding_service import JinaClipEmbeddingService
+from qdrant_service import QdrantVectorService
+from pdf_parser import PDFIndexer
+def index_pdf_directory(
+    pdf_dir: str,
+    category: str = "user_guide",
+    force: bool = False
+):
+    """
+    Index all PDF files in a directory
+    Args:
+        pdf_dir: Directory containing PDF files
+        category: Category for the PDFs (default: "user_guide")
+        force: Force reindex even if already indexed (default: False)
+    """
+    print("="*60)
+    print("PDF Batch Indexer")
+    print("="*60)
+    # Initialize services (same as main.py)
+    print("\n[1/5] Initializing services...")
+    embedding_service = JinaClipEmbeddingService(model_path="jinaai/jina-clip-v2")
+    collection_name = os.getenv("COLLECTION_NAME", "event_social_media")
+    qdrant_service = QdrantVectorService(
+        collection_name=collection_name,
+        vector_size=embedding_service.get_embedding_dimension()
+    )
+    # MongoDB
+    mongodb_uri = os.getenv("MONGODB_URI", "mongodb+srv://truongtn7122003:[email protected]/")
+    mongo_client = MongoClient(mongodb_uri)
+    db = mongo_client[os.getenv("MONGODB_DB_NAME", "chatbot_rag")]
+    documents_collection = db["documents"]
+    # Initialize PDF indexer
+    pdf_indexer = PDFIndexer(
+        embedding_service=embedding_service,
+        qdrant_service=qdrant_service,
+        documents_collection=documents_collection
+    )
+    print("✓ Services initialized")
+    # Find all PDF files
+    print(f"\n[2/5] Scanning directory: {pdf_dir}")
+    pdf_files = list(Path(pdf_dir).glob("*.pdf"))
+    if not pdf_files:
+        print("✗ No PDF files found in directory")
+        return
+    print(f"✓ Found {len(pdf_files)} PDF file(s)")
+    # Index each PDF
+    print(f"\n[3/5] Indexing PDFs...")
+    indexed_count = 0
+    skipped_count = 0
+    error_count = 0
+    for i, pdf_path in enumerate(pdf_files, 1):
+        print(f"\n--- [{i}/{len(pdf_files)}] Processing: {pdf_path.name} ---")
+        # Generate document ID
+        doc_id = f"pdf_{pdf_path.stem}"
+        # Check if already indexed
+        if not force:
+            existing = documents_collection.find_one({"document_id": doc_id})
+            if existing:
+                print(f"⊘ Already indexed (use --force to reindex)")
+                skipped_count += 1
+                continue
+        try:
+            # Index PDF
+            metadata = {
+                'title': pdf_path.stem.replace('_', ' ').title(),
+                'category': category,
+                'source_file': str(pdf_path)
+            }
+            result = pdf_indexer.index_pdf(
+                pdf_path=str(pdf_path),
+                document_id=doc_id,
+                document_metadata=metadata
+            )
+            print(f"✓ Indexed: {result['chunks_indexed']} chunks")
+            indexed_count += 1
+        except Exception as e:
+            print(f"✗ Error: {str(e)}")
+            error_count += 1
+    # Summary
+    print("\n" + "="*60)
+    print("SUMMARY")
+    print("="*60)
+    print(f"Total PDFs found: {len(pdf_files)}")
+    print(f"✓ Successfully indexed: {indexed_count}")
+    print(f"⊘ Skipped (already indexed): {skipped_count}")
+    print(f"✗ Errors: {error_count}")
+    if indexed_count > 0:
+        print(f"\n✓ Knowledge base updated successfully!")
+        print(f"You can now chat with your chatbot about the content in these PDFs.")
+def main():
+    """Main entry point"""
+    if len(sys.argv) < 2:
+        print("Usage: python batch_index_pdfs.py <pdf_directory> [--category=<category>] [--force]")
+        print("\nExample:")
+        print("  python batch_index_pdfs.py ./docs/guides")
+        print("  python batch_index_pdfs.py ./docs/guides --category=user_guide --force")
+        sys.exit(1)
+    pdf_dir = sys.argv[1]
+    if not os.path.isdir(pdf_dir):
+        print(f"Error: Directory not found: {pdf_dir}")
+        sys.exit(1)
+    # Parse options
+    category = "user_guide"
+    force = False
+    for arg in sys.argv[2:]:
+        if arg.startswith("--category="):
+            category = arg.split("=")[1]
+        elif arg == "--force":
+            force = True
+    # Index PDFs
+    index_pdf_directory(pdf_dir, category=category, force=force)
+if __name__ == "__main__":
+    main()

chatbot_guide_template.md ADDED Viewed

	@@ -0,0 +1,369 @@

+# Hướng Dẫn Sử Dụng ChatbotRAG
+*Version 2.0 - Tháng 10, 2025*
+---
+## 1. Giới Thiệu
+### ChatbotRAG là gì?
+ChatbotRAG là hệ thống chatbot thông minh sử dụng công nghệ RAG (Retrieval-Augmented Generation) để trả lời câu hỏi dựa trên cơ sở dữ liệu kiến thức của bạn.
+### Tính năng chính
+- **Multimodal Search**: Tìm kiếm bằng text và hình ảnh
+- **Advanced RAG**: Query expansion, reranking, context compression
+- **PDF Support**: Upload PDF và chat về nội dung trong PDF
+- **Multiple Inputs**: Index nhiều texts và images cùng lúc (tối đa 10 mỗi loại)
+- **Chat History**: Lưu lịch sử chat để theo dõi
+---
+## 2. Bắt Đầu Nhanh
+### Bước 1: Khởi động server
+```bash
+cd ChatbotRAG
+python main.py
+```
+Server sẽ chạy tại: `http://localhost:8000`
+### Bước 2: Truy cập API Documentation
+Mở trình duyệt và truy cập:
+- API Docs: `http://localhost:8000/docs`
+- ReDoc: `http://localhost:8000/redoc`
+### Bước 3: Test với câu hỏi đơn giản
+```bash
+curl -X POST "http://localhost:8000/chat" \
+  -H "Content-Type: application/json" \
+  -d '{"message": "Xin chào, bạn là ai?"}'
+```
+---
+## 3. Index Dữ Liệu
+### 3.1. Index Text Đơn Giản
+```bash
+curl -X POST "http://localhost:8000/index" \
+  -F "id=doc1" \
+  -F "texts=Đây là text nội dung 1" \
+  -F "texts=Đây là text nội dung 2"
+```
+### 3.2. Index Với Images
+```bash
+curl -X POST "http://localhost:8000/index" \
+  -F "id=event123" \
+  -F "texts=Sự kiện âm nhạc tại Hà Nội" \
+  -F "[email protected]" \
+  -F "[email protected]"
+```
+**Lưu ý**: Tối đa 10 texts và 10 images mỗi request.
+### 3.3. Upload PDF
+Để upload tài liệu PDF vào hệ thống:
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@user_guide.pdf" \
+  -F "title=Hướng dẫn sử dụng" \
+  -F "category=user_guide"
+```
+Sau khi upload, chatbot có thể trả lời câu hỏi về nội dung trong PDF.
+---
+## 4. Tìm Kiếm Dữ Liệu
+### 4.1. Search Bằng Text
+```bash
+curl -X POST "http://localhost:8000/search/text" \
+  -F "text=sự kiện âm nhạc" \
+  -F "limit=5"
+```
+### 4.2. Search Bằng Image
+```bash
+curl -X POST "http://localhost:8000/search/image" \
+  -F "image=@query_image.jpg" \
+  -F "limit=5"
+```
+### 4.3. Hybrid Search (Text + Image)
+```bash
+curl -X POST "http://localhost:8000/search" \
+  -F "text=festival music" \
+  -F "[email protected]" \
+  -F "text_weight=0.6" \
+  -F "image_weight=0.4"
+```
+---
+## 5. Chat Với Chatbot
+### 5.1. Chat Cơ Bản (Không RAG)
+```python
+import requests
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Xin chào!',
+    'use_rag': False,
+    'hf_token': 'your_huggingface_token'
+})
+print(response.json()['response'])
+```
+### 5.2. Chat Với RAG (Recommended)
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Festival âm nhạc diễn ra khi nào?',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    'top_k': 5,
+    'hf_token': 'your_token'
+})
+result = response.json()
+print("Answer:", result['response'])
+print("Sources:", result['context_used'])
+```
+### 5.3. Advanced RAG Options
+```python
+response = requests.post('http://localhost:8000/chat', json={
+    'message': 'Câu hỏi của bạn',
+    'use_rag': True,
+    'use_advanced_rag': True,
+    # Advanced RAG settings
+    'use_query_expansion': True,    # Mở rộng câu hỏi
+    'use_reranking': True,          # Rerank kết quả
+    'use_compression': True,        # Nén context
+    'score_threshold': 0.5,         # Ngưỡng relevance (0-1)
+    'top_k': 5,                     # Số documents retrieve
+    # LLM settings
+    'max_tokens': 512,
+    'temperature': 0.7,
+    'hf_token': 'your_token'
+})
+```
+---
+## 6. Quản Lý Documents
+### 6.1. Xem Danh Sách Documents
+```bash
+# Xem stats collection
+curl http://localhost:8000/stats
+# Xem PDFs
+curl http://localhost:8000/documents/pdf
+```
+### 6.2. Get Document By ID
+```bash
+curl http://localhost:8000/document/doc123
+```
+### 6.3. Xóa Document
+```bash
+curl -X DELETE http://localhost:8000/delete/doc123
+```
+### 6.4. Xóa PDF Document
+```bash
+curl -X DELETE http://localhost:8000/documents/pdf/pdf_20251029_143022
+```
+---
+## 7. Câu Hỏi Thường Gặp (FAQ)
+### Q1: Làm sao để upload PDF vào hệ thống?
+**A:** Sử dụng endpoint `/upload-pdf`:
+```bash
+curl -X POST "http://localhost:8000/upload-pdf" \
+  -F "file=@your_file.pdf" \
+  -F "title=Tên tài liệu"
+```
+### Q2: Chatbot không tìm thấy thông tin phù hợp?
+**A:** Thử các cách sau:
+1. Giảm `score_threshold` xuống (0.3 - 0.5)
+2. Tăng `top_k` lên (5-10)
+3. Sử dụng `use_advanced_rag=True`
+4. Rephrase câu hỏi rõ ràng hơn
+### Q3: Làm sao để cải thi��n độ chính xác của chatbot?
+**A:**
+- Bật Advanced RAG: `use_advanced_rag=True`
+- Bật tất cả RAG features: `use_reranking=True`, `use_compression=True`
+- Index nhiều documents với nội dung chi tiết
+- Sử dụng metadata phù hợp khi index
+### Q4: Token limit của LLM là bao nhiêu?
+**A:** Mặc định `max_tokens=512`. Bạn có thể tăng lên trong request:
+```python
+{
+    'message': 'Your question',
+    'max_tokens': 1024,  # Tăng lên
+    'hf_token': 'your_token'
+}
+```
+### Q5: Có thể upload bao nhiêu texts/images cùng lúc?
+**A:** Tối đa **10 texts** và **10 images** mỗi request tại endpoint `/index`.
+### Q6: Chatbot có support tiếng Việt không?
+**A:** Có! Hệ thống sử dụng Jina CLIP v2 hỗ trợ đa ngôn ngữ, bao gồm tiếng Việt.
+### Q7: Làm sao để xem lịch sử chat?
+**A:**
+```bash
+curl "http://localhost:8000/history?limit=10&skip=0"
+```
+### Q8: PDF của tôi có nhiều hình ảnh, có vấn đề gì không?
+**A:** Hệ thống hiện chỉ extract text từ PDF. Hình ảnh trong PDF chưa được xử lý. Nếu cần xử lý hình ảnh trong PDF, có thể integrate RAG-Anything sau.
+---
+## 8. API Reference
+### Endpoints Chính
+| Endpoint | Method | Mô tả |
+|----------|--------|-------|
+| `/` | GET | Health check & API docs |
+| `/index` | POST | Index texts + images (tối đa 10 mỗi loại) |
+| `/search` | POST | Hybrid search (text + image) |
+| `/search/text` | POST | Search chỉ bằng text |
+| `/search/image` | POST | Search chỉ bằng image |
+| `/chat` | POST | Chat với RAG |
+| `/documents` | POST | Add text document |
+| `/upload-pdf` | POST | Upload và index PDF |
+| `/documents/pdf` | GET | List PDFs |
+| `/documents/pdf/{id}` | DELETE | Delete PDF |
+| `/history` | GET | Get chat history |
+| `/stats` | GET | Collection statistics |
+### Request Examples
+**Index with multiple texts:**
+```json
+POST /index
+{
+  "id": "doc123",
+  "texts": ["Text 1", "Text 2", "Text 3"]
+}
+```
+**Chat with Advanced RAG:**
+```json
+POST /chat
+{
+  "message": "Your question",
+  "use_rag": true,
+  "use_advanced_rag": true,
+  "use_reranking": true,
+  "top_k": 5,
+  "score_threshold": 0.5,
+  "hf_token": "hf_xxxxx"
+}
+```
+---
+## 9. Best Practices
+### Index Dữ Liệu
+✓ Chia nhỏ nội dung thành các chunks có nghĩa
+✓ Thêm metadata đầy đủ (title, category, source)
+✓ Sử dụng texts array cho multiple paragraphs
+✗ Tránh index text quá dài trong 1 chunk
+### Chat
+✓ Bật Advanced RAG cho câu hỏi phức tạp
+✓ Điều chỉnh `top_k` và `score_threshold` phù hợp
+✓ Sử dụng `temperature` thấp (0.3-0.5) cho câu trả lời factual
+✗ Tránh đặt `score_threshold` quá cao (>0.8)
+### PDF
+✓ PDF có text layer (không phải scanned image)
+✓ Cấu trúc rõ ràng với headings, paragraphs
+✓ Nội dung ngắn gọn, dễ hiểu
+✗ Tránh PDF quá nhiều hình ảnh phức tạp
+---
+## 10. Troubleshooting
+### Server không khởi động
+- Kiểm tra dependencies: `pip install -r requirements.txt`
+- Kiểm tra MongoDB connection string
+- Kiểm tra Qdrant service
+### Upload PDF lỗi
+- Verify file là PDF hợp lệ
+- Check file không bị corrupt
+- Thử convert lại PDF nếu cần
+### Chatbot không trả lời đúng
+- Kiểm tra documents đã được index chưa: `/stats`
+- Thử giảm `score_threshold`
+- Bật Advanced RAG options
+- Check LLM token (Hugging Face)
+### Out of memory
+- Giảm `chunk_size` trong PDF parser
+- Giảm `top_k` trong chat request
+- Index ít documents hơn mỗi lần
+---
+## 11. Liên Hệ & Support
+Nếu có thắc mắc hoặc vấn đề:
+- Check server logs
+- Review API documentation tại `/docs`
+- Xem GitHub issues
+---
+**Happy Chatting! 🤖**

chatbot_rag.py ADDED Viewed

	@@ -0,0 +1,351 @@

+import gradio as gr
+from huggingface_hub import InferenceClient
+from pymongo import MongoClient
+from datetime import datetime
+from typing import List, Dict
+import numpy as np
+from embedding_service import JinaClipEmbeddingService
+from qdrant_service import QdrantVectorService
+class ChatbotRAG:
+    """
+    Chatbot RAG với:
+    - LLM: GPT-OSS-20B (Hugging Face)
+    - Embeddings: Jina CLIP v2
+    - Vector DB: Qdrant
+    - Document Store: MongoDB
+    """
+    def __init__(
+        self,
+        mongodb_uri: str = "mongodb+srv://truongtn7122003:[email protected]/",
+        db_name: str = "chatbot_rag",
+        collection_name: str = "documents"
+    ):
+        """
+        Initialize ChatbotRAG
+        Args:
+            mongodb_uri: MongoDB connection string
+            db_name: Database name
+            collection_name: Collection name for documents
+        """
+        print("Initializing ChatbotRAG...")
+        # MongoDB client
+        self.mongo_client = MongoClient(mongodb_uri)
+        self.db = self.mongo_client[db_name]
+        self.documents_collection = self.db[collection_name]
+        self.chat_history_collection = self.db["chat_history"]
+        # Embedding service (Jina CLIP v2)
+        self.embedding_service = JinaClipEmbeddingService(
+            model_path="jinaai/jina-clip-v2"
+        )
+        # Qdrant vector service
+        self.qdrant_service = QdrantVectorService(
+            collection_name="chatbot_rag_vectors",
+            vector_size=self.embedding_service.get_embedding_dimension()
+        )
+        print("✓ ChatbotRAG initialized successfully")
+    def add_document(self, text: str, metadata: Dict = None) -> str:
+        """
+        Add document to MongoDB and Qdrant
+        Args:
+            text: Document text
+            metadata: Additional metadata
+        Returns:
+            Document ID
+        """
+        # Save to MongoDB
+        doc_data = {
+            "text": text,
+            "metadata": metadata or {},
+            "created_at": datetime.utcnow()
+        }
+        result = self.documents_collection.insert_one(doc_data)
+        doc_id = str(result.inserted_id)
+        # Generate embedding
+        embedding = self.embedding_service.encode_text(text)
+        # Index to Qdrant
+        self.qdrant_service.index_data(
+            doc_id=doc_id,
+            embedding=embedding,
+            metadata={
+                "text": text,
+                "source": "user_upload",
+                **(metadata or {})
+            }
+        )
+        return doc_id
+    def retrieve_context(self, query: str, top_k: int = 3) -> List[Dict]:
+        """
+        Retrieve relevant context from vector DB
+        Args:
+            query: User query
+            top_k: Number of results to retrieve
+        Returns:
+            List of relevant documents
+        """
+        # Generate query embedding
+        query_embedding = self.embedding_service.encode_text(query)
+        # Search in Qdrant
+        results = self.qdrant_service.search(
+            query_embedding=query_embedding,
+            limit=top_k,
+            score_threshold=0.5  # Only get relevant results
+        )
+        return results
+    def save_chat_history(self, user_message: str, assistant_response: str, context_used: List[Dict]):
+        """
+        Save chat interaction to MongoDB
+        Args:
+            user_message: User's message
+            assistant_response: Assistant's response
+            context_used: Context retrieved from RAG
+        """
+        chat_data = {
+            "user_message": user_message,
+            "assistant_response": assistant_response,
+            "context_used": context_used,
+            "timestamp": datetime.utcnow()
+        }
+        self.chat_history_collection.insert_one(chat_data)
+    def respond(
+        self,
+        message: str,
+        history: List[Dict[str, str]],
+        system_message: str,
+        max_tokens: int,
+        temperature: float,
+        top_p: float,
+        use_rag: bool,
+        hf_token: gr.OAuthToken,
+    ):
+        """
+        Generate response with RAG
+        Args:
+            message: User message
+            history: Chat history
+            system_message: System prompt
+            max_tokens: Max tokens to generate
+            temperature: Temperature for generation
+            top_p: Top-p sampling
+            use_rag: Whether to use RAG retrieval
+            hf_token: Hugging Face token
+        Yields:
+            Generated response
+        """
+        # Initialize LLM client
+        client = InferenceClient(token=hf_token.token, model="openai/gpt-oss-20b")
+        # Prepare context from RAG
+        context_text = ""
+        context_used = []
+        if use_rag:
+            # Retrieve relevant context
+            retrieved_docs = self.retrieve_context(message, top_k=3)
+            context_used = retrieved_docs
+            if retrieved_docs:
+                context_text = "\n\n**Relevant Context:**\n"
+                for i, doc in enumerate(retrieved_docs, 1):
+                    doc_text = doc["metadata"].get("text", "")
+                    confidence = doc["confidence"]
+                    context_text += f"\n[{i}] (Confidence: {confidence:.2f})\n{doc_text}\n"
+                # Add context to system message
+                system_message = f"{system_message}\n\n{context_text}\n\nPlease use the above context to answer the user's question when relevant."
+        # Build messages for LLM
+        messages = [{"role": "system", "content": system_message}]
+        messages.extend(history)
+        messages.append({"role": "user", "content": message})
+        # Generate response
+        response = ""
+        try:
+            for msg in client.chat_completion(
+                messages,
+                max_tokens=max_tokens,
+                stream=True,
+                temperature=temperature,
+                top_p=top_p,
+            ):
+                choices = msg.choices
+                token = ""
+                if len(choices) and choices[0].delta.content:
+                    token = choices[0].delta.content
+                response += token
+                yield response
+            # Save to chat history
+            self.save_chat_history(message, response, context_used)
+        except Exception as e:
+            error_msg = f"Error generating response: {str(e)}"
+            yield error_msg
+# Initialize ChatbotRAG
+chatbot_rag = ChatbotRAG()
+def respond_wrapper(
+    message,
+    history,
+    system_message,
+    max_tokens,
+    temperature,
+    top_p,
+    use_rag,
+    hf_token,
+):
+    """Wrapper for Gradio ChatInterface"""
+    yield from chatbot_rag.respond(
+        message=message,
+        history=history,
+        system_message=system_message,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        use_rag=use_rag,
+        hf_token=hf_token,
+    )
+def add_document_to_rag(text: str) -> str:
+    """
+    Add document to RAG knowledge base
+    Args:
+        text: Document text
+    Returns:
+        Success message
+    """
+    try:
+        doc_id = chatbot_rag.add_document(text)
+        return f"✓ Document added successfully! ID: {doc_id}"
+    except Exception as e:
+        return f"✗ Error adding document: {str(e)}"
+# Create Gradio interface
+with gr.Blocks(title="ChatbotRAG - GPT-OSS-20B + Jina CLIP v2 + MongoDB") as demo:
+    gr.Markdown("""
+    # 🤖 ChatbotRAG
+    **Features:**
+    - 💬 LLM: GPT-OSS-20B
+    - 🔍 Embeddings: Jina CLIP v2 (Vietnamese support)
+    - 📊 Vector DB: Qdrant Cloud
+    - 🗄️ Document Store: MongoDB
+    **How to use:**
+    1. Add documents to knowledge base (optional)
+    2. Toggle "Use RAG" to enable context retrieval
+    3. Chat with the bot!
+    """)
+    with gr.Sidebar():
+        gr.LoginButton()
+        gr.Markdown("### ⚙️ Settings")
+        use_rag = gr.Checkbox(
+            label="Use RAG",
+            value=True,
+            info="Enable RAG to retrieve relevant context from knowledge base"
+        )
+        system_message = gr.Textbox(
+            value="You are a helpful AI assistant. Answer questions based on the provided context when available.",
+            label="System message",
+            lines=3
+        )
+        max_tokens = gr.Slider(
+            minimum=1,
+            maximum=2048,
+            value=512,
+            step=1,
+            label="Max new tokens"
+        )
+        temperature = gr.Slider(
+            minimum=0.1,
+            maximum=4.0,
+            value=0.7,
+            step=0.1,
+            label="Temperature"
+        )
+        top_p = gr.Slider(
+            minimum=0.1,
+            maximum=1.0,
+            value=0.95,
+            step=0.05,
+            label="Top-p (nucleus sampling)"
+        )
+    # Chat interface
+    chatbot = gr.ChatInterface(
+        respond_wrapper,
+        type="messages",
+        additional_inputs=[
+            system_message,
+            max_tokens,
+            temperature,
+            top_p,
+            use_rag,
+        ],
+    )
+    # Document management
+    with gr.Accordion("📚 Knowledge Base Management", open=False):
+        gr.Markdown("### Add Documents to Knowledge Base")
+        doc_text = gr.Textbox(
+            label="Document Text",
+            placeholder="Enter document text here...",
+            lines=5
+        )
+        add_btn = gr.Button("Add Document", variant="primary")
+        output_msg = gr.Textbox(label="Status", interactive=False)
+        add_btn.click(
+            fn=add_document_to_rag,
+            inputs=[doc_text],
+            outputs=[output_msg]
+        )
+    chatbot.render()
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860)

chatbot_rag_api.py ADDED Viewed

	@@ -0,0 +1,468 @@

+from fastapi import FastAPI, HTTPException, File, UploadFile, Form
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Optional, List, Dict
+from pymongo import MongoClient
+from datetime import datetime
+import numpy as np
+import os
+from huggingface_hub import InferenceClient
+from embedding_service import JinaClipEmbeddingService
+from qdrant_service import QdrantVectorService
+# Pydantic models
+class ChatRequest(BaseModel):
+    message: str
+    use_rag: bool = True
+    top_k: int = 3
+    system_message: Optional[str] = "You are a helpful AI assistant."
+    max_tokens: int = 512
+    temperature: float = 0.7
+    top_p: float = 0.95
+    hf_token: Optional[str] = None  # Hugging Face token (optional, sẽ dùng env nếu không truyền)
+class ChatResponse(BaseModel):
+    response: str
+    context_used: List[Dict]
+    timestamp: str
+class AddDocumentRequest(BaseModel):
+    text: str
+    metadata: Optional[Dict] = None
+class AddDocumentResponse(BaseModel):
+    success: bool
+    doc_id: str
+    message: str
+class SearchRequest(BaseModel):
+    query: str
+    top_k: int = 5
+    score_threshold: Optional[float] = 0.5
+class SearchResponse(BaseModel):
+    results: List[Dict]
+# Initialize FastAPI
+app = FastAPI(
+    title="ChatbotRAG API",
+    description="API for RAG Chatbot with GPT-OSS-20B + Jina CLIP v2 + MongoDB + Qdrant",
+    version="1.0.0"
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],  # Cho phép tất cả origins (có thể giới hạn trong production)
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ChatbotRAG Service
+class ChatbotRAGService:
+    """
+    ChatbotRAG Service cho API
+    """
+    def __init__(
+        self,
+        mongodb_uri: str = "mongodb+srv://truongtn7122003:[email protected]/",
+        db_name: str = "chatbot_rag",
+        collection_name: str = "documents",
+        hf_token: Optional[str] = None
+    ):
+        print("Initializing ChatbotRAG Service...")
+        # MongoDB
+        self.mongo_client = MongoClient(mongodb_uri)
+        self.db = self.mongo_client[db_name]
+        self.documents_collection = self.db[collection_name]
+        self.chat_history_collection = self.db["chat_history"]
+        # Embedding service
+        self.embedding_service = JinaClipEmbeddingService(
+            model_path="jinaai/jina-clip-v2"
+        )
+        # Qdrant
+        collection_name = os.getenv("COLLECTION_NAME","event_social_media")
+        self.qdrant_service = QdrantVectorService(
+            collection_name= collection_name,
+            vector_size=self.embedding_service.get_embedding_dimension()
+        )
+        # Hugging Face token (từ env hoặc truyền vào)
+        self.hf_token = hf_token or os.getenv("HUGGINGFACE_TOKEN")
+        if self.hf_token:
+            print("✓ Hugging Face token configured")
+        else:
+            print("⚠ No Hugging Face token - LLM generation will use placeholder")
+        print("✓ ChatbotRAG Service initialized")
+    def add_document(self, text: str, metadata: Dict = None) -> str:
+        """Add document to knowledge base"""
+        # Save to MongoDB
+        doc_data = {
+            "text": text,
+            "metadata": metadata or {},
+            "created_at": datetime.utcnow()
+        }
+        result = self.documents_collection.insert_one(doc_data)
+        doc_id = str(result.inserted_id)
+        # Generate embedding
+        embedding = self.embedding_service.encode_text(text)
+        # Index to Qdrant
+        self.qdrant_service.index_data(
+            doc_id=doc_id,
+            embedding=embedding,
+            metadata={
+                "text": text,
+                "source": "api",
+                **(metadata or {})
+            }
+        )
+        return doc_id
+    def retrieve_context(self, query: str, top_k: int = 3, score_threshold: float = 0.5) -> List[Dict]:
+        """Retrieve relevant context from vector DB"""
+        # Generate query embedding
+        query_embedding = self.embedding_service.encode_text(query)
+        # Search in Qdrant
+        results = self.qdrant_service.search(
+            query_embedding=query_embedding,
+            limit=top_k,
+            score_threshold=score_threshold
+        )
+        return results
+    def generate_response(
+        self,
+        message: str,
+        context: List[Dict],
+        system_message: str,
+        max_tokens: int = 512,
+        temperature: float = 0.7,
+        top_p: float = 0.95,
+        hf_token: Optional[str] = None
+    ) -> str:
+        """
+        Generate response using Hugging Face LLM
+        """
+        # Build context text
+        context_text = ""
+        if context:
+            context_text = "\n\nRelevant Context:\n"
+            for i, doc in enumerate(context, 1):
+                doc_text = doc["metadata"].get("text", "")
+                confidence = doc["confidence"]
+                context_text += f"\n[{i}] (Confidence: {confidence:.2f})\n{doc_text}\n"
+            # Add context to system message
+            system_message = f"{system_message}\n{context_text}\n\nPlease use the above context to answer the user's question when relevant."
+        # Use token from request or fallback to service token
+        token = hf_token or self.hf_token
+        # If no token available, return placeholder
+        if not token:
+            return f"""[LLM Response Placeholder]
+Context retrieved: {len(context)} documents
+User question: {message}
+To enable actual LLM generation:
+1. Set HUGGINGFACE_TOKEN environment variable, OR
+2. Pass hf_token in request body
+Example:
+{{
+  "message": "Your question",
+  "hf_token": "hf_xxxxxxxxxxxxx"
+}}
+"""
+        # Initialize HF Inference Client
+        try:
+            client = InferenceClient(
+                token=token,
+                model="openai/gpt-oss-20b"
+            )
+            # Build messages
+            messages = [
+                {"role": "system", "content": system_message},
+                {"role": "user", "content": message}
+            ]
+            # Generate response (non-streaming for API)
+            response = ""
+            for msg in client.chat_completion(
+                messages,
+                max_tokens=max_tokens,
+                stream=True,
+                temperature=temperature,
+                top_p=top_p,
+            ):
+                choices = msg.choices
+                if len(choices) and choices[0].delta.content:
+                    response += choices[0].delta.content
+            return response
+        except Exception as e:
+            return f"Error generating response with LLM: {str(e)}\n\nContext was retrieved successfully, but LLM generation failed."
+    def save_chat_history(self, user_message: str, assistant_response: str, context_used: List[Dict]):
+        """Save chat to MongoDB"""
+        chat_data = {
+            "user_message": user_message,
+            "assistant_response": assistant_response,
+            "context_used": context_used,
+            "timestamp": datetime.utcnow()
+        }
+        self.chat_history_collection.insert_one(chat_data)
+    def get_stats(self) -> Dict:
+        """Get statistics"""
+        return {
+            "documents_count": self.documents_collection.count_documents({}),
+            "chat_history_count": self.chat_history_collection.count_documents({}),
+            "qdrant_info": self.qdrant_service.get_collection_info()
+        }
+# Initialize service
+rag_service = ChatbotRAGService()
+# API Endpoints
+@app.get("/")
+async def root():
+    """Health check"""
+    return {
+        "status": "running",
+        "service": "ChatbotRAG API",
+        "version": "1.0.0",
+        "endpoints": {
+            "POST /chat": "Chat with RAG",
+            "POST /documents": "Add document to knowledge base",
+            "POST /search": "Search in knowledge base",
+            "GET /stats": "Get statistics",
+            "GET /history": "Get chat history"
+        }
+    }
+@app.post("/chat", response_model=ChatResponse)
+async def chat(request: ChatRequest):
+    """
+    Chat endpoint with RAG
+    Body:
+    - message: User message
+    - use_rag: Enable RAG retrieval (default: true)
+    - top_k: Number of documents to retrieve (default: 3)
+    - system_message: System prompt (optional)
+    - max_tokens: Max tokens for response (default: 512)
+    - temperature: Temperature for generation (default: 0.7)
+    Returns:
+    - response: Generated response
+    - context_used: Retrieved context documents
+    - timestamp: Response timestamp
+    """
+    try:
+        # Retrieve context if RAG enabled
+        context_used = []
+        if request.use_rag:
+            context_used = rag_service.retrieve_context(
+                query=request.message,
+                top_k=request.top_k
+            )
+        # Generate response
+        response = rag_service.generate_response(
+            message=request.message,
+            context=context_used,
+            system_message=request.system_message,
+            max_tokens=request.max_tokens,
+            temperature=request.temperature,
+            top_p=request.top_p,
+            hf_token=request.hf_token
+        )
+        # Save to history
+        rag_service.save_chat_history(
+            user_message=request.message,
+            assistant_response=response,
+            context_used=context_used
+        )
+        return ChatResponse(
+            response=response,
+            context_used=context_used,
+            timestamp=datetime.utcnow().isoformat()
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.post("/documents", response_model=AddDocumentResponse)
+async def add_document(request: AddDocumentRequest):
+    """
+    Add document to knowledge base
+    Body:
+    - text: Document text
+    - metadata: Additional metadata (optional)
+    Returns:
+    - success: True/False
+    - doc_id: MongoDB document ID
+    - message: Status message
+    """
+    try:
+        doc_id = rag_service.add_document(
+            text=request.text,
+            metadata=request.metadata
+        )
+        return AddDocumentResponse(
+            success=True,
+            doc_id=doc_id,
+            message=f"Document added successfully with ID: {doc_id}"
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.post("/search", response_model=SearchResponse)
+async def search(request: SearchRequest):
+    """
+    Search in knowledge base
+    Body:
+    - query: Search query
+    - top_k: Number of results (default: 5)
+    - score_threshold: Minimum score (default: 0.5)
+    Returns:
+    - results: List of matching documents
+    """
+    try:
+        results = rag_service.retrieve_context(
+            query=request.query,
+            top_k=request.top_k,
+            score_threshold=request.score_threshold
+        )
+        return SearchResponse(results=results)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.get("/stats")
+async def get_stats():
+    """
+    Get statistics
+    Returns:
+    - documents_count: Number of documents in MongoDB
+    - chat_history_count: Number of chat messages
+    - qdrant_info: Qdrant collection info
+    """
+    try:
+        return rag_service.get_stats()
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.get("/history")
+async def get_history(limit: int = 10, skip: int = 0):
+    """
+    Get chat history
+    Query params:
+    - limit: Number of messages to return (default: 10)
+    - skip: Number of messages to skip (default: 0)
+    Returns:
+    - history: List of chat messages
+    """
+    try:
+        history = list(
+            rag_service.chat_history_collection
+            .find({}, {"_id": 0})
+            .sort("timestamp", -1)
+            .skip(skip)
+            .limit(limit)
+        )
+        # Convert datetime to string
+        for msg in history:
+            if "timestamp" in msg:
+                msg["timestamp"] = msg["timestamp"].isoformat()
+        return {"history": history, "total": rag_service.chat_history_collection.count_documents({})}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.delete("/documents/{doc_id}")
+async def delete_document(doc_id: str):
+    """
+    Delete document from knowledge base
+    Args:
+    - doc_id: Document ID (MongoDB ObjectId)
+    Returns:
+    - success: True/False
+    - message: Status message
+    """
+    try:
+        # Delete from MongoDB
+        result = rag_service.documents_collection.delete_one({"_id": doc_id})
+        # Delete from Qdrant
+        if result.deleted_count > 0:
+            rag_service.qdrant_service.delete_by_id(doc_id)
+            return {"success": True, "message": f"Document {doc_id} deleted"}
+        else:
+            raise HTTPException(status_code=404, detail=f"Document {doc_id} not found")
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        app,
+        host="0.0.0.0",
+        port=8000,
+        log_level="info"
+    )

embedding_service.py ADDED Viewed

	@@ -0,0 +1,173 @@

+import torch
+import numpy as np
+from PIL import Image
+from transformers import AutoModel
+from typing import Union, List
+import io
+class JinaClipEmbeddingService:
+    """
+    Jina CLIP v2 Embedding Service với hỗ trợ tiếng Việt
+    Sử dụng AutoModel với trust_remote_code
+    """
+    def __init__(self, model_path: str = "jinaai/jina-clip-v2"):
+        """
+        Initialize Jina CLIP v2 model
+        Args:
+            model_path: Path to model hoặc HuggingFace model name
+        """
+        print(f"Loading Jina CLIP v2 model from {model_path}...")
+        # Load model với trust_remote_code
+        self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
+        # Chuyển sang eval mode
+        self.model.eval()
+        # Sử dụng GPU nếu có
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        self.model.to(self.device)
+        print(f"✓ Loaded Jina CLIP v2 model on: {self.device}")
+    def encode_text(
+        self,
+        text: Union[str, List[str]],
+        truncate_dim: int = None,
+        normalize: bool = True
+    ) -> np.ndarray:
+        """
+        Encode text thành vector embeddings (hỗ trợ tiếng Việt)
+        Args:
+            text: Text hoặc list of texts (tiếng Việt)
+            truncate_dim: Matryoshka dimension (64-1024, None = full 1024)
+            normalize: Có normalize embeddings không
+        Returns:
+            numpy array của embeddings
+        """
+        if isinstance(text, str):
+            text = [text]
+        # Jina CLIP v2 encode_text method
+        # Automatically handles tokenization internally
+        embeddings = self.model.encode_text(
+            text,
+            truncate_dim=truncate_dim  # Optional: 64, 128, 256, 512, 1024
+        )
+        # Convert to numpy
+        if isinstance(embeddings, torch.Tensor):
+            embeddings = embeddings.cpu().detach().numpy()
+        # Normalize nếu cần
+        if normalize:
+            embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
+        return embeddings
+    def encode_image(
+        self,
+        image: Union[Image.Image, bytes, List, str],
+        truncate_dim: int = None,
+        normalize: bool = True
+    ) -> np.ndarray:
+        """
+        Encode image thành vector embeddings
+        Args:
+            image: PIL Image, bytes, URL string, hoặc list of images
+            truncate_dim: Matryoshka dimension (64-1024, None = full 1024)
+            normalize: Có normalize embeddings không
+        Returns:
+            numpy array của embeddings
+        """
+        # Convert bytes to PIL Image nếu cần
+        if isinstance(image, bytes):
+            image = Image.open(io.BytesIO(image)).convert('RGB')
+        elif isinstance(image, list):
+            processed_images = []
+            for img in image:
+                if isinstance(img, bytes):
+                    processed_images.append(Image.open(io.BytesIO(img)).convert('RGB'))
+                elif isinstance(img, str):
+                    # URL string - keep as is, Jina CLIP can handle URLs
+                    processed_images.append(img)
+                else:
+                    processed_images.append(img)
+            image = processed_images
+        elif not isinstance(image, list) and not isinstance(image, str):
+            # Single PIL Image
+            image = [image]
+        # Jina CLIP v2 encode_image method
+        # Supports PIL Images, file paths, or URLs
+        embeddings = self.model.encode_image(
+            image,
+            truncate_dim=truncate_dim  # Optional: 64, 128, 256, 512, 1024
+        )
+        # Convert to numpy
+        if isinstance(embeddings, torch.Tensor):
+            embeddings = embeddings.cpu().detach().numpy()
+        # Normalize nếu cần
+        if normalize:
+            embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
+        return embeddings
+    def encode_multimodal(
+        self,
+        text: Union[str, List[str]] = None,
+        image: Union[Image.Image, bytes, List] = None,
+        truncate_dim: int = None,
+        normalize: bool = True
+    ) -> np.ndarray:
+        """
+        Encode cả text và image, trả về embeddings kết hợp
+        Args:
+            text: Text hoặc list of texts (tiếng Việt)
+            image: PIL Image, bytes, hoặc list of images
+            truncate_dim: Matryoshka dimension (64-1024, None = full 1024)
+            normalize: Có normalize embeddings không
+        Returns:
+            numpy array của embeddings
+        """
+        embeddings = []
+        if text is not None:
+            text_emb = self.encode_text(text, truncate_dim=truncate_dim, normalize=False)
+            embeddings.append(text_emb)
+        if image is not None:
+            image_emb = self.encode_image(image, truncate_dim=truncate_dim, normalize=False)
+            embeddings.append(image_emb)
+        # Combine embeddings (average)
+        if len(embeddings) == 2:
+            # Average của text và image embeddings
+            combined = np.mean(embeddings, axis=0)
+        elif len(embeddings) == 1:
+            combined = embeddings[0]
+        else:
+            raise ValueError("Phải cung cấp ít nhất text hoặc image")
+        # Normalize nếu cần
+        if normalize:
+            combined = combined / np.linalg.norm(combined, axis=1, keepdims=True)
+        return combined
+    def get_embedding_dimension(self) -> int:
+        """
+        Trả về dimension của embeddings (1024 cho Jina CLIP v2)
+        """
+        return 1024

main.py ADDED Viewed

	@@ -0,0 +1,1285 @@

+from fastapi import FastAPI, UploadFile, File, Form, HTTPException
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from typing import Optional, List, Dict
+from PIL import Image
+import io
+import numpy as np
+import os
+from datetime import datetime
+from pymongo import MongoClient
+from huggingface_hub import InferenceClient
+from embedding_service import JinaClipEmbeddingService
+from qdrant_service import QdrantVectorService
+from advanced_rag import AdvancedRAG
+from pdf_parser import PDFIndexer
+from multimodal_pdf_parser import MultimodalPDFIndexer
+# Initialize FastAPI app
+app = FastAPI(
+    title="Event Social Media Embeddings & ChatbotRAG API",
+    description="API để embeddings, search và ChatbotRAG với Jina CLIP v2 + Qdrant + MongoDB + LLM",
+    version="2.0.0"
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Initialize services
+print("Initializing services...")
+embedding_service = JinaClipEmbeddingService(model_path="jinaai/jina-clip-v2")
+collection_name = os.getenv("COLLECTION_NAME", "event_social_media")
+qdrant_service = QdrantVectorService(
+    collection_name=collection_name,
+    vector_size=embedding_service.get_embedding_dimension()
+)
+print(f"✓ Qdrant collection: {collection_name}")
+# MongoDB connection
+mongodb_uri = os.getenv("MONGODB_URI", "mongodb+srv://truongtn7122003:[email protected]/")
+mongo_client = MongoClient(mongodb_uri)
+db = mongo_client[os.getenv("MONGODB_DB_NAME", "chatbot_rag")]
+documents_collection = db["documents"]
+chat_history_collection = db["chat_history"]
+print("✓ MongoDB connected")
+# Hugging Face token
+hf_token = os.getenv("HUGGINGFACE_TOKEN")
+if hf_token:
+    print("✓ Hugging Face token configured")
+# Initialize Advanced RAG
+advanced_rag = AdvancedRAG(
+    embedding_service=embedding_service,
+    qdrant_service=qdrant_service
+)
+print("✓ Advanced RAG pipeline initialized")
+# Initialize PDF Indexer
+pdf_indexer = PDFIndexer(
+    embedding_service=embedding_service,
+    qdrant_service=qdrant_service,
+    documents_collection=documents_collection
+)
+print("✓ PDF Indexer initialized")
+# Initialize Multimodal PDF Indexer (for PDFs with images)
+multimodal_pdf_indexer = MultimodalPDFIndexer(
+    embedding_service=embedding_service,
+    qdrant_service=qdrant_service,
+    documents_collection=documents_collection
+)
+print("✓ Multimodal PDF Indexer initialized")
+print("✓ Services initialized successfully")
+# Pydantic models for embeddings
+class SearchRequest(BaseModel):
+    text: Optional[str] = None
+    limit: int = 10
+    score_threshold: Optional[float] = None
+    text_weight: float = 0.5
+    image_weight: float = 0.5
+class SearchResponse(BaseModel):
+    id: str
+    confidence: float
+    metadata: dict
+class IndexResponse(BaseModel):
+    success: bool
+    id: str
+    message: str
+# Pydantic models for ChatbotRAG
+class ChatRequest(BaseModel):
+    message: str
+    use_rag: bool = True
+    top_k: int = 3
+    system_message: Optional[str] = "You are a helpful AI assistant."
+    max_tokens: int = 512
+    temperature: float = 0.7
+    top_p: float = 0.95
+    hf_token: Optional[str] = None
+    # Advanced RAG options
+    use_advanced_rag: bool = True
+    use_query_expansion: bool = True
+    use_reranking: bool = True
+    use_compression: bool = True
+    score_threshold: float = 0.5
+class ChatResponse(BaseModel):
+    response: str
+    context_used: List[Dict]
+    timestamp: str
+    rag_stats: Optional[Dict] = None  # Stats from advanced RAG pipeline
+class AddDocumentRequest(BaseModel):
+    text: str
+    metadata: Optional[Dict] = None
+class AddDocumentResponse(BaseModel):
+    success: bool
+    doc_id: str
+    message: str
+class UploadPDFResponse(BaseModel):
+    success: bool
+    document_id: str
+    filename: str
+    chunks_indexed: int
+    message: str
+@app.get("/")
+async def root():
+    """Health check endpoint with comprehensive API documentation"""
+    return {
+        "status": "running",
+        "service": "ChatbotRAG API - Advanced RAG with Multimodal Support",
+        "version": "3.0.0",
+        "vector_db": "Qdrant",
+        "document_db": "MongoDB",
+        "features": {
+            "multiple_inputs": "Index up to 10 texts + 10 images per request",
+            "advanced_rag": "Query expansion, reranking, contextual compression",
+            "pdf_support": "Upload PDFs and chat about their content",
+            "multimodal_pdf": "PDFs with text and image URLs - perfect for user guides",
+            "chat_history": "Track conversation history",
+            "hybrid_search": "Text + image search with Jina CLIP v2"
+        },
+        "endpoints": {
+            "indexing": {
+                "POST /index": {
+                    "description": "Index multiple texts and images (NEW: up to 10 each)",
+                    "content_type": "multipart/form-data",
+                    "body": {
+                        "id": "string (required) - Document ID",
+                        "texts": "List[string] (optional) - Up to 10 texts",
+                        "images": "List[UploadFile] (optional) - Up to 10 images"
+                    },
+                    "example": "curl -X POST '/index' -F 'id=doc1' -F 'texts=Text 1' -F 'texts=Text 2' -F '[email protected]'",
+                    "response": {
+                        "success": True,
+                        "id": "doc1",
+                        "message": "Indexed successfully with 2 texts and 1 images"
+                    }
+                },
+                "POST /documents": {
+                    "description": "Add text document to knowledge base",
+                    "content_type": "application/json",
+                    "body": {
+                        "text": "string (required) - Document content",
+                        "metadata": "object (optional) - Additional metadata"
+                    },
+                    "example": {
+                        "text": "How to create event: Click 'Create Event' button...",
+                        "metadata": {"category": "tutorial", "source": "user_guide"}
+                    }
+                },
+                "POST /upload-pdf": {
+                    "description": "Upload PDF file (text only)",
+                    "content_type": "multipart/form-data",
+                    "body": {
+                        "file": "UploadFile (required) - PDF file",
+                        "title": "string (optional) - Document title",
+                        "category": "string (optional) - Category",
+                        "description": "string (optional) - Description"
+                    },
+                    "example": "curl -X POST '/upload-pdf' -F '[email protected]' -F 'title=User Guide'"
+                },
+                "POST /upload-pdf-multimodal": {
+                    "description": "Upload PDF with text and image URLs (RECOMMENDED for user guides)",
+                    "content_type": "multipart/form-data",
+                    "features": [
+                        "Extracts text from PDF",
+                        "Detects image URLs (http://, https://)",
+                        "Supports markdown: ![alt](url)",
+                        "Supports HTML: <img src='url'>",
+                        "Links images to text chunks",
+                        "Returns images with context in chat"
+                    ],
+                    "body": {
+                        "file": "UploadFile (required) - PDF file with image URLs",
+                        "title": "string (optional) - Document title",
+                        "category": "string (optional) - e.g. 'user_guide', 'tutorial'",
+                        "description": "string (optional)"
+                    },
+                    "example": "curl -X POST '/upload-pdf-multimodal' -F 'file=@guide_with_images.pdf' -F 'category=user_guide'",
+                    "response": {
+                        "success": True,
+                        "document_id": "pdf_multimodal_20251029_150000",
+                        "chunks_indexed": 25,
+                        "message": "PDF indexed with 25 chunks and 15 images"
+                    },
+                    "use_case": "Perfect for user guides with screenshots, tutorials with diagrams"
+                }
+            },
+            "search": {
+                "POST /search": {
+                    "description": "Hybrid search with text and/or image",
+                    "body": {
+                        "text": "string (optional) - Query text",
+                        "image": "UploadFile (optional) - Query image",
+                        "limit": "int (default: 10)",
+                        "score_threshold": "float (optional, 0-1)",
+                        "text_weight": "float (default: 0.5)",
+                        "image_weight": "float (default: 0.5)"
+                    }
+                },
+                "POST /search/text": {
+                    "description": "Text-only search",
+                    "body": {"text": "string", "limit": "int", "score_threshold": "float"}
+                },
+                "POST /search/image": {
+                    "description": "Image-only search",
+                    "body": {"image": "UploadFile", "limit": "int", "score_threshold": "float"}
+                },
+                "POST /rag/search": {
+                    "description": "Search in RAG knowledge base",
+                    "body": {"query": "string", "top_k": "int (default: 5)", "score_threshold": "float (default: 0.5)"}
+                }
+            },
+            "chat": {
+                "POST /chat": {
+                    "description": "Chat với Advanced RAG (Query expansion + Reranking + Compression)",
+                    "content_type": "application/json",
+                    "body": {
+                        "message": "string (required) - User question",
+                        "use_rag": "bool (default: true) - Enable RAG retrieval",
+                        "use_advanced_rag": "bool (default: true) - Use advanced RAG pipeline (RECOMMENDED)",
+                        "use_query_expansion": "bool (default: true) - Expand query with variations",
+                        "use_reranking": "bool (default: true) - Rerank results for accuracy",
+                        "use_compression": "bool (default: true) - Compress context to relevant parts",
+                        "top_k": "int (default: 3) - Number of documents to retrieve",
+                        "score_threshold": "float (default: 0.5) - Min relevance score (0-1)",
+                        "max_tokens": "int (default: 512) - Max response tokens",
+                        "temperature": "float (default: 0.7) - Creativity (0-1)",
+                        "hf_token": "string (optional) - Hugging Face token"
+                    },
+                    "response": {
+                        "response": "string - AI answer",
+                        "context_used": "array - Retrieved documents with metadata",
+                        "timestamp": "string",
+                        "rag_stats": "object - RAG pipeline statistics (query variants, retrieval counts)"
+                    },
+                    "example_advanced": {
+                        "message": "Làm sao để upload PDF có hình ảnh?",
+                        "use_advanced_rag": True,
+                        "use_reranking": True,
+                        "top_k": 5,
+                        "score_threshold": 0.5
+                    },
+                    "example_response_with_images": {
+                        "response": "Để upload PDF có hình ảnh, sử dụng endpoint /upload-pdf-multimodal...",
+                        "context_used": [
+                            {
+                                "id": "pdf_multimodal_...._p2_c1",
+                                "confidence": 0.89,
+                                "metadata": {
+                                    "text": "Bước 1: Chuẩn bị PDF với image URLs...",
+                                    "has_images": True,
+                                    "image_urls": [
+                                        "https://example.com/screenshot1.png",
+                                        "https://example.com/diagram.jpg"
+                                    ],
+                                    "num_images": 2,
+                                    "page": 2
+                                }
+                            }
+                        ],
+                        "rag_stats": {
+                            "original_query": "Làm sao để upload PDF có hình ảnh?",
+                            "expanded_queries": ["upload PDF hình ảnh", "PDF có ảnh"],
+                            "initial_results": 10,
+                            "after_rerank": 5,
+                            "after_compression": 5
+                        }
+                    },
+                    "notes": [
+                        "Advanced RAG significantly improves answer quality",
+                        "When multimodal PDF is used, images are returned in metadata",
+                        "Requires HUGGINGFACE_TOKEN for actual LLM generation"
+                    ]
+                },
+                "GET /history": {
+                    "description": "Get chat history",
+                    "query_params": {"limit": "int (default: 10)", "skip": "int (default: 0)"},
+                    "response": {"history": "array", "total": "int"}
+                }
+            },
+            "management": {
+                "GET /documents/pdf": {
+                    "description": "List all PDF documents",
+                    "response": {"documents": "array", "total": "int"}
+                },
+                "DELETE /documents/pdf/{document_id}": {
+                    "description": "Delete PDF and all its chunks",
+                    "response": {"success": "bool", "message": "string"}
+                },
+                "GET /document/{doc_id}": {
+                    "description": "Get document by ID",
+                    "response": {"success": "bool", "data": "object"}
+                },
+                "DELETE /delete/{doc_id}": {
+                    "description": "Delete document by ID",
+                    "response": {"success": "bool", "message": "string"}
+                },
+                "GET /stats": {
+                    "description": "Get Qdrant collection statistics",
+                    "response": {"vectors_count": "int", "segments": "int", ...}
+                }
+            }
+        },
+        "quick_start": {
+            "1_upload_multimodal_pdf": "curl -X POST '/upload-pdf-multimodal' -F 'file=@user_guide.pdf' -F 'title=Guide'",
+            "2_verify_upload": "curl '/documents/pdf'",
+            "3_chat_with_rag": "curl -X POST '/chat' -H 'Content-Type: application/json' -d '{\"message\": \"How to...?\", \"use_advanced_rag\": true}'",
+            "4_see_images_in_context": "response['context_used'][0]['metadata']['image_urls']"
+        },
+        "use_cases": {
+            "user_guide_with_screenshots": {
+                "endpoint": "/upload-pdf-multimodal",
+                "description": "PDFs with text instructions + image URLs for visual guidance",
+                "benefits": ["Images linked to text chunks", "Chatbot returns relevant screenshots", "Perfect for step-by-step guides"]
+            },
+            "simple_text_docs": {
+                "endpoint": "/upload-pdf",
+                "description": "Simple PDFs with text only (FAQ, policies, etc.)"
+            },
+            "social_media_posts": {
+                "endpoint": "/index",
+                "description": "Index multiple posts with texts (up to 10) and images (up to 10)"
+            },
+            "complex_queries": {
+                "endpoint": "/chat",
+                "description": "Use advanced RAG for better accuracy on complex questions",
+                "settings": {"use_advanced_rag": True, "use_reranking": True, "use_compression": True}
+            }
+        },
+        "best_practices": {
+            "pdf_format": [
+                "Include image URLs in text (http://, https://)",
+                "Use markdown format: ![alt](url) or HTML: <img src='url'>",
+                "Clear structure with headings and sections",
+                "Link images close to their related text"
+            ],
+            "chat_settings": {
+                "for_accuracy": {"temperature": 0.3, "use_advanced_rag": True, "use_reranking": True},
+                "for_creativity": {"temperature": 0.8, "use_advanced_rag": False},
+                "for_factual_answers": {"temperature": 0.3, "use_compression": True, "score_threshold": 0.6}
+            },
+            "retrieval_tuning": {
+                "not_finding_info": "Lower score_threshold to 0.3-0.4, increase top_k to 7-10",
+                "too_much_context": "Increase score_threshold to 0.6-0.7, decrease top_k to 3-5",
+                "slow_responses": "Disable compression, use basic RAG, decrease top_k"
+            }
+        },
+        "links": {
+            "docs": "http://localhost:8000/docs",
+            "redoc": "http://localhost:8000/redoc",
+            "openapi": "http://localhost:8000/openapi.json",
+            "guides": {
+                "multimodal_pdf": "See MULTIMODAL_PDF_GUIDE.md",
+                "advanced_rag": "See ADVANCED_RAG_GUIDE.md",
+                "pdf_general": "See PDF_RAG_GUIDE.md",
+                "quick_start": "See QUICK_START_PDF.md"
+            }
+        },
+        "system_info": {
+            "embedding_model": "Jina CLIP v2 (multimodal)",
+            "vector_db": "Qdrant with HNSW index",
+            "document_db": "MongoDB",
+            "rag_pipeline": "Advanced RAG with query expansion, reranking, compression",
+            "pdf_parser": "pypdfium2 with URL extraction",
+            "max_inputs": "10 texts + 10 images per /index request"
+        }
+    }
+@app.post("/index", response_model=IndexResponse)
+async def index_data(
+    id: str = Form(...),
+    texts: Optional[List[str]] = Form(None),
+    images: Optional[List[UploadFile]] = File(None)
+):
+    """
+    Index data vào vector database (hỗ trợ nhiều texts và images)
+    Body:
+    - id: Document ID (event ID, post ID, etc.)
+    - texts: List of text contents (tiếng Việt supported) - Tối đa 10 texts
+    - images: List of image files (optional) - Tối đa 10 images
+    Returns:
+    - success: True/False
+    - id: Document ID
+    - message: Status message
+    """
+    try:
+        # Validation
+        if texts is None and images is None:
+            raise HTTPException(status_code=400, detail="Phải cung cấp ít nhất texts hoặc images")
+        if texts and len(texts) > 10:
+            raise HTTPException(status_code=400, detail="Tối đa 10 texts")
+        if images and len(images) > 10:
+            raise HTTPException(status_code=400, detail="Tối đa 10 images")
+        # Prepare embeddings
+        text_embeddings = []
+        image_embeddings = []
+        # Encode multiple texts (tiếng Việt)
+        if texts:
+            for text in texts:
+                if text and text.strip():
+                    text_emb = embedding_service.encode_text(text)
+                    text_embeddings.append(text_emb)
+        # Encode multiple images
+        if images:
+            for image in images:
+                if image.filename:  # Check if image is provided
+                    image_bytes = await image.read()
+                    pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
+                    image_emb = embedding_service.encode_image(pil_image)
+                    image_embeddings.append(image_emb)
+        # Combine embeddings
+        all_embeddings = []
+        if text_embeddings:
+            # Average all text embeddings
+            avg_text_embedding = np.mean(text_embeddings, axis=0)
+            all_embeddings.append(avg_text_embedding)
+        if image_embeddings:
+            # Average all image embeddings
+            avg_image_embedding = np.mean(image_embeddings, axis=0)
+            all_embeddings.append(avg_image_embedding)
+        if not all_embeddings:
+            raise HTTPException(status_code=400, detail="Không có embedding nào được tạo từ texts hoặc images")
+        # Final combined embedding
+        combined_embedding = np.mean(all_embeddings, axis=0)
+        # Normalize
+        combined_embedding = combined_embedding / np.linalg.norm(combined_embedding, axis=1, keepdims=True)
+        # Index vào Qdrant
+        metadata = {
+            "texts": texts if texts else [],
+            "text_count": len(texts) if texts else 0,
+            "image_count": len(images) if images else 0,
+            "image_filenames": [img.filename for img in images] if images else []
+        }
+        result = qdrant_service.index_data(
+            doc_id=id,
+            embedding=combined_embedding,
+            metadata=metadata
+        )
+        return IndexResponse(
+            success=True,
+            id=result["original_id"],  # Trả về MongoDB ObjectId
+            message=f"Đã index thành công document {result['original_id']} với {len(texts) if texts else 0} texts và {len(images) if images else 0} images (Qdrant UUID: {result['qdrant_id']})"
+        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi index: {str(e)}")
+@app.post("/search", response_model=List[SearchResponse])
+async def search(
+    text: Optional[str] = Form(None),
+    image: Optional[UploadFile] = File(None),
+    limit: int = Form(10),
+    score_threshold: Optional[float] = Form(None),
+    text_weight: float = Form(0.5),
+    image_weight: float = Form(0.5)
+):
+    """
+    Search similar documents bằng text và/hoặc image
+    Body:
+    - text: Query text (tiếng Việt supported)
+    - image: Query image (optional)
+    - limit: Số lượng kết quả (default: 10)
+    - score_threshold: Minimum confidence score (0-1)
+    - text_weight: Weight cho text search (default: 0.5)
+    - image_weight: Weight cho image search (default: 0.5)
+    Returns:
+    - List of results với id, confidence, và metadata
+    """
+    try:
+        # Prepare query embeddings
+        text_embedding = None
+        image_embedding = None
+        # Encode text query
+        if text and text.strip():
+            text_embedding = embedding_service.encode_text(text)
+        # Encode image query
+        if image:
+            image_bytes = await image.read()
+            pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
+            image_embedding = embedding_service.encode_image(pil_image)
+        # Validate input
+        if text_embedding is None and image_embedding is None:
+            raise HTTPException(status_code=400, detail="Phải cung cấp ít nhất text hoặc image để search")
+        # Hybrid search với Qdrant
+        results = qdrant_service.hybrid_search(
+            text_embedding=text_embedding,
+            image_embedding=image_embedding,
+            text_weight=text_weight,
+            image_weight=image_weight,
+            limit=limit,
+            score_threshold=score_threshold,
+            ef=256  # High accuracy search
+        )
+        # Format response
+        return [
+            SearchResponse(
+                id=result["id"],
+                confidence=result["confidence"],
+                metadata=result["metadata"]
+            )
+            for result in results
+        ]
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi search: {str(e)}")
+@app.post("/search/text", response_model=List[SearchResponse])
+async def search_by_text(
+    text: str = Form(...),
+    limit: int = Form(10),
+    score_threshold: Optional[float] = Form(None)
+):
+    """
+    Search chỉ bằng text (tiếng Việt)
+    Body:
+    - text: Query text (tiếng Việt)
+    - limit: Số lượng kết quả
+    - score_threshold: Minimum confidence score
+    Returns:
+    - List of results
+    """
+    try:
+        # Encode text
+        text_embedding = embedding_service.encode_text(text)
+        # Search
+        results = qdrant_service.search(
+            query_embedding=text_embedding,
+            limit=limit,
+            score_threshold=score_threshold,
+            ef=256
+        )
+        return [
+            SearchResponse(
+                id=result["id"],
+                confidence=result["confidence"],
+                metadata=result["metadata"]
+            )
+            for result in results
+        ]
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi search: {str(e)}")
+@app.post("/search/image", response_model=List[SearchResponse])
+async def search_by_image(
+    image: UploadFile = File(...),
+    limit: int = Form(10),
+    score_threshold: Optional[float] = Form(None)
+):
+    """
+    Search chỉ bằng image
+    Body:
+    - image: Query image
+    - limit: Số lượng kết quả
+    - score_threshold: Minimum confidence score
+    Returns:
+    - List of results
+    """
+    try:
+        # Encode image
+        image_bytes = await image.read()
+        pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
+        image_embedding = embedding_service.encode_image(pil_image)
+        # Search
+        results = qdrant_service.search(
+            query_embedding=image_embedding,
+            limit=limit,
+            score_threshold=score_threshold,
+            ef=256
+        )
+        return [
+            SearchResponse(
+                id=result["id"],
+                confidence=result["confidence"],
+                metadata=result["metadata"]
+            )
+            for result in results
+        ]
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi search: {str(e)}")
+@app.delete("/delete/{doc_id}")
+async def delete_document(doc_id: str):
+    """
+    Delete document by ID (MongoDB ObjectId hoặc UUID)
+    Args:
+    - doc_id: Document ID to delete
+    Returns:
+    - Success message
+    """
+    try:
+        qdrant_service.delete_by_id(doc_id)
+        return {"success": True, "message": f"Đã xóa document {doc_id}"}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi xóa: {str(e)}")
+@app.get("/document/{doc_id}")
+async def get_document(doc_id: str):
+    """
+    Get document by ID (MongoDB ObjectId hoặc UUID)
+    Args:
+    - doc_id: Document ID (MongoDB ObjectId)
+    Returns:
+    - Document data
+    """
+    try:
+        doc = qdrant_service.get_by_id(doc_id)
+        if doc:
+            return {
+                "success": True,
+                "data": doc
+            }
+        raise HTTPException(status_code=404, detail=f"Không tìm thấy document {doc_id}")
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi get document: {str(e)}")
+@app.get("/stats")
+async def get_stats():
+    """
+    Lấy thông tin thống kê collection
+    Returns:
+    - Collection statistics
+    """
+    try:
+        info = qdrant_service.get_collection_info()
+        return info
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Lỗi khi lấy stats: {str(e)}")
+# ============================================
+# ChatbotRAG Endpoints
+# ============================================
+@app.post("/chat", response_model=ChatResponse)
+async def chat(request: ChatRequest):
+    """
+    Chat endpoint với Advanced RAG
+    Body:
+    - message: User message
+    - use_rag: Enable RAG retrieval (default: true)
+    - top_k: Number of documents to retrieve (default: 3)
+    - system_message: System prompt (optional)
+    - max_tokens: Max tokens for response (default: 512)
+    - temperature: Temperature for generation (default: 0.7)
+    - hf_token: Hugging Face token (optional, sẽ dùng env nếu không truyền)
+    - use_advanced_rag: Use advanced RAG pipeline (default: true)
+    - use_query_expansion: Enable query expansion (default: true)
+    - use_reranking: Enable reranking (default: true)
+    - use_compression: Enable context compression (default: true)
+    - score_threshold: Minimum relevance score (default: 0.5)
+    Returns:
+    - response: Generated response
+    - context_used: Retrieved context documents
+    - timestamp: Response timestamp
+    - rag_stats: Statistics from RAG pipeline
+    """
+    try:
+        # Retrieve context if RAG enabled
+        context_used = []
+        rag_stats = None
+        if request.use_rag:
+            if request.use_advanced_rag:
+                # Use Advanced RAG Pipeline
+                documents, stats = advanced_rag.hybrid_rag_pipeline(
+                    query=request.message,
+                    top_k=request.top_k,
+                    score_threshold=request.score_threshold,
+                    use_reranking=request.use_reranking,
+                    use_compression=request.use_compression,
+                    max_context_tokens=500
+                )
+                # Convert to dict format for compatibility
+                context_used = [
+                    {
+                        "id": doc.id,
+                        "confidence": doc.confidence,
+                        "metadata": doc.metadata
+                    }
+                    for doc in documents
+                ]
+                rag_stats = stats
+                # Format context using advanced RAG formatter
+                context_text = advanced_rag.format_context_for_llm(documents)
+            else:
+                # Use basic RAG (original implementation)
+                query_embedding = embedding_service.encode_text(request.message)
+                results = qdrant_service.search(
+                    query_embedding=query_embedding,
+                    limit=request.top_k,
+                    score_threshold=request.score_threshold
+                )
+                context_used = results
+                # Build context text (basic format)
+                context_text = "\n\nRelevant Context:\n"
+                for i, doc in enumerate(context_used, 1):
+                    doc_text = doc["metadata"].get("text", "")
+                    confidence = doc["confidence"]
+                    context_text += f"\n[{i}] (Confidence: {confidence:.2f})\n{doc_text}\n"
+        # Build system message with context
+        if request.use_rag and context_used:
+            if request.use_advanced_rag:
+                # Use advanced prompt builder
+                system_message = advanced_rag.build_rag_prompt(
+                    query=request.message,
+                    context=context_text,
+                    system_message=request.system_message
+                )
+            else:
+                # Basic prompt
+                system_message = f"{request.system_message}\n{context_text}\n\nPlease use the above context to answer the user's question when relevant."
+        else:
+            system_message = request.system_message
+        # Use token from request or fallback to env
+        token = request.hf_token or hf_token
+        # Generate response
+        if not token:
+            response = f"""[LLM Response Placeholder]
+Context retrieved: {len(context_used)} documents
+User question: {request.message}
+To enable actual LLM generation:
+1. Set HUGGINGFACE_TOKEN environment variable, OR
+2. Pass hf_token in request body
+Example:
+{{
+  "message": "Your question",
+  "hf_token": "hf_xxxxxxxxxxxxx"
+}}
+"""
+        else:
+            try:
+                client = InferenceClient(
+                    token=hf_token,
+                    model="openai/gpt-oss-20b"
+                )
+                # Build messages
+                messages = [
+                    {"role": "system", "content": system_message},
+                    {"role": "user", "content": request.message}
+                ]
+                # Generate response
+                response = ""
+                for msg in client.chat_completion(
+                    messages,
+                    max_tokens=request.max_tokens,
+                    stream=True,
+                    temperature=request.temperature,
+                    top_p=request.top_p,
+                ):
+                    choices = msg.choices
+                    if len(choices) and choices[0].delta.content:
+                        response += choices[0].delta.content
+            except Exception as e:
+                response = f"Error generating response with LLM: {str(e)}\n\nContext was retrieved successfully, but LLM generation failed."
+        # Save to history
+        chat_data = {
+            "user_message": request.message,
+            "assistant_response": response,
+            "context_used": context_used,
+            "timestamp": datetime.utcnow()
+        }
+        chat_history_collection.insert_one(chat_data)
+        return ChatResponse(
+            response=response,
+            context_used=context_used,
+            timestamp=datetime.utcnow().isoformat(),
+            rag_stats=rag_stats
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.post("/documents", response_model=AddDocumentResponse)
+async def add_document(request: AddDocumentRequest):
+    """
+    Add document to knowledge base
+    Body:
+    - text: Document text
+    - metadata: Additional metadata (optional)
+    Returns:
+    - success: True/False
+    - doc_id: MongoDB document ID
+    - message: Status message
+    """
+    try:
+        # Save to MongoDB
+        doc_data = {
+            "text": request.text,
+            "metadata": request.metadata or {},
+            "created_at": datetime.utcnow()
+        }
+        result = documents_collection.insert_one(doc_data)
+        doc_id = str(result.inserted_id)
+        # Generate embedding
+        embedding = embedding_service.encode_text(request.text)
+        # Index to Qdrant
+        qdrant_service.index_data(
+            doc_id=doc_id,
+            embedding=embedding,
+            metadata={
+                "text": request.text,
+                "source": "api",
+                **(request.metadata or {})
+            }
+        )
+        return AddDocumentResponse(
+            success=True,
+            doc_id=doc_id,
+            message=f"Document added successfully with ID: {doc_id}"
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.post("/rag/search", response_model=List[SearchResponse])
+async def rag_search(
+    query: str = Form(...),
+    top_k: int = Form(5),
+    score_threshold: Optional[float] = Form(0.5)
+):
+    """
+    Search in knowledge base
+    Body:
+    - query: Search query
+    - top_k: Number of results (default: 5)
+    - score_threshold: Minimum score (default: 0.5)
+    Returns:
+    - results: List of matching documents
+    """
+    try:
+        # Generate query embedding
+        query_embedding = embedding_service.encode_text(query)
+        # Search in Qdrant
+        results = qdrant_service.search(
+            query_embedding=query_embedding,
+            limit=top_k,
+            score_threshold=score_threshold
+        )
+        return [
+            SearchResponse(
+                id=result["id"],
+                confidence=result["confidence"],
+                metadata=result["metadata"]
+            )
+            for result in results
+        ]
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.get("/history")
+async def get_history(limit: int = 10, skip: int = 0):
+    """
+    Get chat history
+    Query params:
+    - limit: Number of messages to return (default: 10)
+    - skip: Number of messages to skip (default: 0)
+    Returns:
+    - history: List of chat messages
+    """
+    try:
+        history = list(
+            chat_history_collection
+            .find({}, {"_id": 0})
+            .sort("timestamp", -1)
+            .skip(skip)
+            .limit(limit)
+        )
+        # Convert datetime to string
+        for msg in history:
+            if "timestamp" in msg:
+                msg["timestamp"] = msg["timestamp"].isoformat()
+        return {
+            "history": history,
+            "total": chat_history_collection.count_documents({})
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.delete("/documents/{doc_id}")
+async def delete_document_from_kb(doc_id: str):
+    """
+    Delete document from knowledge base
+    Args:
+    - doc_id: Document ID (MongoDB ObjectId)
+    Returns:
+    - success: True/False
+    - message: Status message
+    """
+    try:
+        # Delete from MongoDB
+        result = documents_collection.delete_one({"_id": doc_id})
+        # Delete from Qdrant
+        if result.deleted_count > 0:
+            qdrant_service.delete_by_id(doc_id)
+            return {"success": True, "message": f"Document {doc_id} deleted from knowledge base"}
+        else:
+            raise HTTPException(status_code=404, detail=f"Document {doc_id} not found")
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.post("/upload-pdf", response_model=UploadPDFResponse)
+async def upload_pdf(
+    file: UploadFile = File(...),
+    document_id: Optional[str] = Form(None),
+    title: Optional[str] = Form(None),
+    description: Optional[str] = Form(None),
+    category: Optional[str] = Form(None)
+):
+    """
+    Upload and index PDF file into knowledge base
+    Body (multipart/form-data):
+    - file: PDF file (required)
+    - document_id: Custom document ID (optional, auto-generated if not provided)
+    - title: Document title (optional)
+    - description: Document description (optional)
+    - category: Document category (optional, e.g., "user_guide", "faq")
+    Returns:
+    - success: True/False
+    - document_id: Document ID
+    - filename: Original filename
+    - chunks_indexed: Number of chunks created
+    - message: Status message
+    Example:
+    ```bash
+    curl -X POST "http://localhost:8000/upload-pdf" \
+      -F "file=@user_guide.pdf" \
+      -F "title=Hướng dẫn sử dụng ChatbotRAG" \
+      -F "category=user_guide"
+    ```
+    """
+    try:
+        # Validate file type
+        if not file.filename.endswith('.pdf'):
+            raise HTTPException(status_code=400, detail="Only PDF files are allowed")
+        # Generate document ID if not provided
+        if not document_id:
+            from datetime import datetime
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            document_id = f"pdf_{timestamp}"
+        # Read PDF bytes
+        pdf_bytes = await file.read()
+        # Prepare metadata
+        metadata = {}
+        if title:
+            metadata['title'] = title
+        if description:
+            metadata['description'] = description
+        if category:
+            metadata['category'] = category
+        # Index PDF
+        result = pdf_indexer.index_pdf_bytes(
+            pdf_bytes=pdf_bytes,
+            document_id=document_id,
+            filename=file.filename,
+            document_metadata=metadata
+        )
+        return UploadPDFResponse(
+            success=True,
+            document_id=result['document_id'],
+            filename=result['filename'],
+            chunks_indexed=result['chunks_indexed'],
+            message=f"PDF '{file.filename}' đã được index thành công với {result['chunks_indexed']} chunks"
+        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error uploading PDF: {str(e)}")
+@app.get("/documents/pdf")
+async def list_pdf_documents():
+    """
+    List all PDF documents in knowledge base
+    Returns:
+    - documents: List of PDF documents with metadata
+    """
+    try:
+        docs = list(documents_collection.find(
+            {"type": "pdf"},
+            {"_id": 0}
+        ))
+        return {"documents": docs, "total": len(docs)}
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.delete("/documents/pdf/{document_id}")
+async def delete_pdf_document(document_id: str):
+    """
+    Delete PDF document and all its chunks from knowledge base
+    Args:
+    - document_id: Document ID
+    Returns:
+    - success: True/False
+    - message: Status message
+    """
+    try:
+        # Get document info
+        doc = documents_collection.find_one({"document_id": document_id, "type": "pdf"})
+        if not doc:
+            raise HTTPException(status_code=404, detail=f"PDF document {document_id} not found")
+        # Delete all chunks from Qdrant
+        chunk_ids = doc.get('chunk_ids', [])
+        for chunk_id in chunk_ids:
+            try:
+                qdrant_service.delete_by_id(chunk_id)
+            except:
+                pass  # Chunk might already be deleted
+        # Delete from MongoDB
+        documents_collection.delete_one({"document_id": document_id})
+        return {
+            "success": True,
+            "message": f"PDF document {document_id} and {len(chunk_ids)} chunks deleted"
+        }
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
+@app.post("/upload-pdf-multimodal", response_model=UploadPDFResponse)
+async def upload_pdf_multimodal(
+    file: UploadFile = File(...),
+    document_id: Optional[str] = Form(None),
+    title: Optional[str] = Form(None),
+    description: Optional[str] = Form(None),
+    category: Optional[str] = Form(None)
+):
+    """
+    Upload PDF with text and image URLs (for user guides with screenshots)
+    This endpoint is optimized for PDFs containing:
+    - Text instructions
+    - Image URLs (http://... or https://...)
+    - Markdown images: ![alt](url)
+    - HTML images: <img src="url">
+    The system will:
+    1. Extract text from PDF
+    2. Detect all image URLs in the text
+    3. Link images to their corresponding text chunks
+    4. Store image URLs in metadata
+    5. Return images along with text during chat
+    Body (multipart/form-data):
+    - file: PDF file (required)
+    - document_id: Custom document ID (optional, auto-generated if not provided)
+    - title: Document title (optional)
+    - description: Document description (optional)
+    - category: Document category (optional, e.g., "user_guide", "tutorial")
+    Returns:
+    - success: True/False
+    - document_id: Document ID
+    - filename: Original filename
+    - chunks_indexed: Number of chunks created
+    - message: Status message (includes image count)
+    Example:
+    ```bash
+    curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
+      -F "file=@user_guide_with_images.pdf" \
+      -F "title=Hướng dẫn có ảnh minh họa" \
+      -F "category=user_guide"
+    ```
+    Example Response:
+    ```json
+    {
+      "success": true,
+      "document_id": "pdf_20251029_150000",
+      "filename": "user_guide_with_images.pdf",
+      "chunks_indexed": 25,
+      "message": "PDF 'user_guide_with_images.pdf' indexed with 25 chunks and 15 images"
+    }
+    ```
+    """
+    try:
+        # Validate file type
+        if not file.filename.endswith('.pdf'):
+            raise HTTPException(status_code=400, detail="Only PDF files are allowed")
+        # Generate document ID if not provided
+        if not document_id:
+            from datetime import datetime
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            document_id = f"pdf_multimodal_{timestamp}"
+        # Read PDF bytes
+        pdf_bytes = await file.read()
+        # Prepare metadata
+        metadata = {'type': 'multimodal'}
+        if title:
+            metadata['title'] = title
+        if description:
+            metadata['description'] = description
+        if category:
+            metadata['category'] = category
+        # Index PDF with multimodal parser
+        result = multimodal_pdf_indexer.index_pdf_bytes(
+            pdf_bytes=pdf_bytes,
+            document_id=document_id,
+            filename=file.filename,
+            document_metadata=metadata
+        )
+        return UploadPDFResponse(
+            success=True,
+            document_id=result['document_id'],
+            filename=result['filename'],
+            chunks_indexed=result['chunks_indexed'],
+            message=f"PDF '{file.filename}' indexed successfully with {result['chunks_indexed']} chunks and {result.get('images_found', 0)} images"
+        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Error uploading multimodal PDF: {str(e)}")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        app,
+        host="0.0.0.0",
+        port=8000,
+        log_level="info"
+    )

multimodal_pdf_parser.py ADDED Viewed

	@@ -0,0 +1,390 @@

+"""
+Enhanced Multimodal PDF Parser for PDFs with Text + Image URLs
+Extracts text, detects image URLs, and links them together
+"""
+import pypdfium2 as pdfium
+from typing import List, Dict, Optional, Tuple
+import re
+from dataclasses import dataclass, field
+@dataclass
+class MultimodalChunk:
+    """Represents a chunk with text and associated images"""
+    text: str
+    page_number: int
+    chunk_index: int
+    image_urls: List[str] = field(default_factory=list)
+    metadata: Dict = field(default_factory=dict)
+class MultimodalPDFParser:
+    """
+    Enhanced PDF Parser that extracts text and image URLs
+    Perfect for user guides with screenshots and visual instructions
+    """
+    def __init__(
+        self,
+        chunk_size: int = 500,
+        chunk_overlap: int = 50,
+        min_chunk_size: int = 50,
+        extract_images: bool = True
+    ):
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.min_chunk_size = min_chunk_size
+        self.extract_images = extract_images
+        # URL patterns
+        self.url_patterns = [
+            # Standard URLs
+            r'https?://[^\s<>"{}|\\^`\[\]]+',
+            # Markdown images: ![alt](url)
+            r'!\[.*?\]\((https?://[^\s)]+)\)',
+            # HTML images: <img src="url">
+            r'<img[^>]+src=["\']([^"\']+)["\']',
+            # Direct image extensions
+            r'https?://[^\s<>"{}|\\^`\[\]]+\.(?:jpg|jpeg|png|gif|bmp|svg|webp)',
+        ]
+    def extract_image_urls(self, text: str) -> List[str]:
+        """
+        Extract all image URLs from text
+        Args:
+            text: Text content
+        Returns:
+            List of image URLs found
+        """
+        urls = []
+        for pattern in self.url_patterns:
+            matches = re.findall(pattern, text, re.IGNORECASE)
+            urls.extend(matches)
+        # Remove duplicates while preserving order
+        seen = set()
+        unique_urls = []
+        for url in urls:
+            if url not in seen:
+                seen.add(url)
+                unique_urls.append(url)
+        return unique_urls
+    def extract_text_from_pdf(self, pdf_path: str) -> Dict[int, Tuple[str, List[str]]]:
+        """
+        Extract text and image URLs from PDF
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Dictionary mapping page number to (text, image_urls) tuple
+        """
+        pdf_pages = {}
+        try:
+            pdf = pdfium.PdfDocument(pdf_path)
+            for page_num in range(len(pdf)):
+                page = pdf[page_num]
+                textpage = page.get_textpage()
+                text = textpage.get_text_range()
+                # Clean text
+                text = self._clean_text(text)
+                # Extract image URLs if enabled
+                image_urls = []
+                if self.extract_images:
+                    image_urls = self.extract_image_urls(text)
+                pdf_pages[page_num + 1] = (text, image_urls)
+            return pdf_pages
+        except Exception as e:
+            raise Exception(f"Error reading PDF: {str(e)}")
+    def _clean_text(self, text: str) -> str:
+        """Clean extracted text"""
+        # Remove excessive whitespace
+        text = re.sub(r'\s+', ' ', text)
+        # Remove special characters
+        text = text.replace('\x00', '')
+        return text.strip()
+    def chunk_text_with_images(
+        self,
+        text: str,
+        image_urls: List[str],
+        page_number: int
+    ) -> List[MultimodalChunk]:
+        """
+        Split text into chunks and associate images with relevant chunks
+        Args:
+            text: Text to chunk
+            image_urls: Image URLs from the page
+            page_number: Page number
+        Returns:
+            List of MultimodalChunk objects
+        """
+        # Split into words
+        words = text.split()
+        if len(words) < self.min_chunk_size:
+            if len(words) > 0:
+                return [MultimodalChunk(
+                    text=text,
+                    page_number=page_number,
+                    chunk_index=0,
+                    image_urls=image_urls,  # All images go to single chunk
+                    metadata={'page': page_number, 'chunk': 0}
+                )]
+            return []
+        chunks = []
+        chunk_index = 0
+        start = 0
+        # Calculate how to distribute images across chunks
+        images_per_chunk = len(image_urls) // max(1, len(words) // self.chunk_size) if image_urls else 0
+        image_index = 0
+        while start < len(words):
+            end = min(start + self.chunk_size, len(words))
+            chunk_words = words[start:end]
+            chunk_text = ' '.join(chunk_words)
+            # Assign images to this chunk
+            chunk_images = []
+            if image_urls:
+                # Simple strategy: distribute images evenly
+                # or detect if URL appears in chunk text
+                for url in image_urls:
+                    if url in chunk_text:
+                        chunk_images.append(url)
+                # If no URLs found in text, distribute evenly
+                if not chunk_images and image_index < len(image_urls):
+                    # Assign remaining images to chunks
+                    num_imgs = min(images_per_chunk + 1, len(image_urls) - image_index)
+                    chunk_images = image_urls[image_index:image_index + num_imgs]
+                    image_index += num_imgs
+            chunks.append(MultimodalChunk(
+                text=chunk_text,
+                page_number=page_number,
+                chunk_index=chunk_index,
+                image_urls=chunk_images,
+                metadata={
+                    'page': page_number,
+                    'chunk': chunk_index,
+                    'start_word': start,
+                    'end_word': end,
+                    'has_images': len(chunk_images) > 0,
+                    'num_images': len(chunk_images)
+                }
+            ))
+            chunk_index += 1
+            start = end - self.chunk_overlap
+            if start >= len(words) - self.min_chunk_size:
+                break
+        return chunks
+    def parse_pdf(
+        self,
+        pdf_path: str,
+        document_metadata: Optional[Dict] = None
+    ) -> List[MultimodalChunk]:
+        """
+        Parse PDF into multimodal chunks
+        Args:
+            pdf_path: Path to PDF file
+            document_metadata: Additional metadata
+        Returns:
+            List of MultimodalChunk objects
+        """
+        pages_data = self.extract_text_from_pdf(pdf_path)
+        all_chunks = []
+        for page_num, (text, image_urls) in pages_data.items():
+            chunks = self.chunk_text_with_images(text, image_urls, page_num)
+            # Add document metadata
+            if document_metadata:
+                for chunk in chunks:
+                    chunk.metadata.update(document_metadata)
+            all_chunks.extend(chunks)
+        return all_chunks
+    def parse_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_metadata: Optional[Dict] = None
+    ) -> List[MultimodalChunk]:
+        """Parse PDF from bytes"""
+        import tempfile
+        import os
+        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
+            tmp.write(pdf_bytes)
+            tmp_path = tmp.name
+        try:
+            chunks = self.parse_pdf(tmp_path, document_metadata)
+            return chunks
+        finally:
+            if os.path.exists(tmp_path):
+                os.unlink(tmp_path)
+class MultimodalPDFIndexer:
+    """Index multimodal PDF chunks into RAG system"""
+    def __init__(self, embedding_service, qdrant_service, documents_collection):
+        self.embedding_service = embedding_service
+        self.qdrant_service = qdrant_service
+        self.documents_collection = documents_collection
+        self.parser = MultimodalPDFParser()
+    def index_pdf(
+        self,
+        pdf_path: str,
+        document_id: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """Index PDF with image URLs"""
+        chunks = self.parser.parse_pdf(pdf_path, document_metadata)
+        indexed_count = 0
+        chunk_ids = []
+        total_images = 0
+        for chunk in chunks:
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            # Generate embedding (text-based)
+            embedding = self.embedding_service.encode_text(chunk.text)
+            # Prepare metadata with image URLs
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'pdf',
+                'has_images': len(chunk.image_urls) > 0,
+                'image_urls': chunk.image_urls,  # Store image URLs!
+                'num_images': len(chunk.image_urls),
+                **chunk.metadata
+            }
+            # Index to Qdrant
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+            total_images += len(chunk.image_urls)
+        # Save document info
+        doc_info = {
+            'document_id': document_id,
+            'type': 'multimodal_pdf',
+            'file_path': pdf_path,
+            'num_chunks': indexed_count,
+            'total_images': total_images,
+            'chunk_ids': chunk_ids,
+            'metadata': document_metadata or {}
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'chunks_indexed': indexed_count,
+            'images_found': total_images,
+            'chunk_ids': chunk_ids[:5]
+        }
+    def index_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_id: str,
+        filename: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """Index PDF from bytes"""
+        metadata = document_metadata or {}
+        metadata['filename'] = filename
+        chunks = self.parser.parse_pdf_bytes(pdf_bytes, metadata)
+        indexed_count = 0
+        chunk_ids = []
+        total_images = 0
+        for chunk in chunks:
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            embedding = self.embedding_service.encode_text(chunk.text)
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'multimodal_pdf',
+                'filename': filename,
+                'has_images': len(chunk.image_urls) > 0,
+                'image_urls': chunk.image_urls,
+                'num_images': len(chunk.image_urls),
+                **chunk.metadata
+            }
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+            total_images += len(chunk.image_urls)
+        doc_info = {
+            'document_id': document_id,
+            'type': 'multimodal_pdf',
+            'filename': filename,
+            'num_chunks': indexed_count,
+            'total_images': total_images,
+            'chunk_ids': chunk_ids,
+            'metadata': metadata
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'filename': filename,
+            'chunks_indexed': indexed_count,
+            'images_found': total_images,
+            'chunk_ids': chunk_ids[:5]
+        }

pdf_parser.py ADDED Viewed

	@@ -0,0 +1,371 @@

+"""
+PDF Parser Service for RAG Chatbot
+Extracts text from PDF and splits into chunks for indexing
+"""
+import pypdfium2 as pdfium
+from typing import List, Dict, Optional
+import re
+from dataclasses import dataclass
+@dataclass
+class PDFChunk:
+    """Represents a chunk of text from PDF"""
+    text: str
+    page_number: int
+    chunk_index: int
+    metadata: Dict
+class PDFParser:
+    """Parse PDF files and prepare for RAG indexing"""
+    def __init__(
+        self,
+        chunk_size: int = 500,  # words per chunk
+        chunk_overlap: int = 50,  # words overlap between chunks
+        min_chunk_size: int = 50  # minimum words in a chunk
+    ):
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.min_chunk_size = min_chunk_size
+    def extract_text_from_pdf(self, pdf_path: str) -> Dict[int, str]:
+        """
+        Extract text from PDF file
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Dictionary mapping page number to text content
+        """
+        pdf_text = {}
+        try:
+            pdf = pdfium.PdfDocument(pdf_path)
+            for page_num in range(len(pdf)):
+                page = pdf[page_num]
+                textpage = page.get_textpage()
+                text = textpage.get_text_range()
+                # Clean text
+                text = self._clean_text(text)
+                pdf_text[page_num + 1] = text  # 1-indexed pages
+            return pdf_text
+        except Exception as e:
+            raise Exception(f"Error reading PDF: {str(e)}")
+    def _clean_text(self, text: str) -> str:
+        """Clean extracted text"""
+        # Remove excessive whitespace
+        text = re.sub(r'\s+', ' ', text)
+        # Remove special characters that might cause issues
+        text = text.replace('\x00', '')
+        return text.strip()
+    def chunk_text(self, text: str, page_number: int) -> List[PDFChunk]:
+        """
+        Split text into overlapping chunks
+        Args:
+            text: Text to chunk
+            page_number: Page number this text came from
+        Returns:
+            List of PDFChunk objects
+        """
+        # Split into words
+        words = text.split()
+        if len(words) < self.min_chunk_size:
+            # Text too short, return as single chunk
+            if len(words) > 0:
+                return [PDFChunk(
+                    text=text,
+                    page_number=page_number,
+                    chunk_index=0,
+                    metadata={'page': page_number, 'chunk': 0}
+                )]
+            return []
+        chunks = []
+        chunk_index = 0
+        start = 0
+        while start < len(words):
+            # Get chunk
+            end = min(start + self.chunk_size, len(words))
+            chunk_words = words[start:end]
+            chunk_text = ' '.join(chunk_words)
+            chunks.append(PDFChunk(
+                text=chunk_text,
+                page_number=page_number,
+                chunk_index=chunk_index,
+                metadata={
+                    'page': page_number,
+                    'chunk': chunk_index,
+                    'start_word': start,
+                    'end_word': end
+                }
+            ))
+            chunk_index += 1
+            # Move start position with overlap
+            start = end - self.chunk_overlap
+            # Avoid infinite loop
+            if start >= len(words) - self.min_chunk_size:
+                break
+        return chunks
+    def parse_pdf(
+        self,
+        pdf_path: str,
+        document_metadata: Optional[Dict] = None
+    ) -> List[PDFChunk]:
+        """
+        Parse entire PDF into chunks
+        Args:
+            pdf_path: Path to PDF file
+            document_metadata: Additional metadata for the document
+        Returns:
+            List of all chunks from the PDF
+        """
+        # Extract text from all pages
+        pages_text = self.extract_text_from_pdf(pdf_path)
+        # Chunk each page
+        all_chunks = []
+        for page_num, text in pages_text.items():
+            chunks = self.chunk_text(text, page_num)
+            # Add document metadata
+            if document_metadata:
+                for chunk in chunks:
+                    chunk.metadata.update(document_metadata)
+            all_chunks.extend(chunks)
+        return all_chunks
+    def parse_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_metadata: Optional[Dict] = None
+    ) -> List[PDFChunk]:
+        """
+        Parse PDF from bytes (for uploaded files)
+        Args:
+            pdf_bytes: PDF file as bytes
+            document_metadata: Additional metadata
+        Returns:
+            List of chunks
+        """
+        import tempfile
+        import os
+        # Save to temp file
+        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
+            tmp.write(pdf_bytes)
+            tmp_path = tmp.name
+        try:
+            chunks = self.parse_pdf(tmp_path, document_metadata)
+            return chunks
+        finally:
+            # Clean up temp file
+            if os.path.exists(tmp_path):
+                os.unlink(tmp_path)
+    def get_pdf_info(self, pdf_path: str) -> Dict:
+        """
+        Get basic info about PDF
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Dictionary with PDF information
+        """
+        try:
+            pdf = pdfium.PdfDocument(pdf_path)
+            info = {
+                'num_pages': len(pdf),
+                'file_path': pdf_path,
+            }
+            return info
+        except Exception as e:
+            raise Exception(f"Error reading PDF info: {str(e)}")
+class PDFIndexer:
+    """Index PDF chunks into RAG system"""
+    def __init__(self, embedding_service, qdrant_service, documents_collection):
+        self.embedding_service = embedding_service
+        self.qdrant_service = qdrant_service
+        self.documents_collection = documents_collection
+        self.parser = PDFParser()
+    def index_pdf(
+        self,
+        pdf_path: str,
+        document_id: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """
+        Index entire PDF into RAG system
+        Args:
+            pdf_path: Path to PDF file
+            document_id: Unique ID for this document
+            document_metadata: Additional metadata (title, author, etc.)
+        Returns:
+            Indexing results
+        """
+        # Parse PDF
+        chunks = self.parser.parse_pdf(pdf_path, document_metadata)
+        # Index each chunk
+        indexed_count = 0
+        chunk_ids = []
+        for chunk in chunks:
+            # Generate unique ID for chunk
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            # Generate embedding
+            embedding = self.embedding_service.encode_text(chunk.text)
+            # Prepare metadata
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'pdf',
+                **chunk.metadata
+            }
+            # Index to Qdrant
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+        # Save document info to MongoDB
+        doc_info = {
+            'document_id': document_id,
+            'type': 'pdf',
+            'file_path': pdf_path,
+            'num_chunks': indexed_count,
+            'chunk_ids': chunk_ids,
+            'metadata': document_metadata or {},
+            'pdf_info': self.parser.get_pdf_info(pdf_path)
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'chunks_indexed': indexed_count,
+            'chunk_ids': chunk_ids[:5]  # Return first 5 as sample
+        }
+    def index_pdf_bytes(
+        self,
+        pdf_bytes: bytes,
+        document_id: str,
+        filename: str,
+        document_metadata: Optional[Dict] = None
+    ) -> Dict:
+        """
+        Index PDF from bytes (for uploaded files)
+        Args:
+            pdf_bytes: PDF file as bytes
+            document_id: Unique ID for this document
+            filename: Original filename
+            document_metadata: Additional metadata
+        Returns:
+            Indexing results
+        """
+        # Parse PDF
+        metadata = document_metadata or {}
+        metadata['filename'] = filename
+        chunks = self.parser.parse_pdf_bytes(pdf_bytes, metadata)
+        # Index each chunk
+        indexed_count = 0
+        chunk_ids = []
+        for chunk in chunks:
+            # Generate unique ID for chunk
+            chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
+            # Generate embedding
+            embedding = self.embedding_service.encode_text(chunk.text)
+            # Prepare metadata
+            metadata = {
+                'text': chunk.text,
+                'document_id': document_id,
+                'page': chunk.page_number,
+                'chunk_index': chunk.chunk_index,
+                'source': 'pdf',
+                'filename': filename,
+                **chunk.metadata
+            }
+            # Index to Qdrant
+            self.qdrant_service.index_data(
+                doc_id=chunk_id,
+                embedding=embedding,
+                metadata=metadata
+            )
+            chunk_ids.append(chunk_id)
+            indexed_count += 1
+        # Save document info to MongoDB
+        doc_info = {
+            'document_id': document_id,
+            'type': 'pdf',
+            'filename': filename,
+            'num_chunks': indexed_count,
+            'chunk_ids': chunk_ids,
+            'metadata': metadata
+        }
+        self.documents_collection.insert_one(doc_info)
+        return {
+            'success': True,
+            'document_id': document_id,
+            'filename': filename,
+            'chunks_indexed': indexed_count,
+            'chunk_ids': chunk_ids[:5]
+        }

qdrant_service.py ADDED Viewed

	@@ -0,0 +1,447 @@

+from qdrant_client import QdrantClient
+from qdrant_client.models import (
+    Distance, VectorParams, PointStruct,
+    SearchRequest, SearchParams, HnswConfigDiff,
+    OptimizersConfigDiff, ScalarQuantization,
+    ScalarQuantizationConfig, ScalarType,
+    QuantizationSearchParams
+)
+from typing import List, Dict, Any, Optional
+import numpy as np
+import uuid
+import os
+class QdrantVectorService:
+    """
+    Qdrant Cloud Vector Database Service với cấu hình tối ưu
+    - HNSW algorithm với parameters mạnh mẽ nhất
+    - Scalar Quantization để tối ưu memory và speed
+    - Hỗ trợ hybrid search (text + image)
+    """
+    def __init__(
+        self,
+        url: Optional[str] = None,
+        api_key: Optional[str] = None,
+        collection_name: str = "event_social_media",
+        vector_size: int = 1024,  # Jina CLIP v2 dimension
+    ):
+        """
+        Initialize Qdrant Cloud client
+        Args:
+            url: Qdrant Cloud URL (từ env hoặc truyền vào)
+            api_key: Qdrant API key (từ env hoặc truyền vào)
+            collection_name: Tên collection
+            vector_size: Dimension của vectors (1024 cho Jina CLIP v2)
+        """
+        # Lấy credentials từ env nếu không truyền vào
+        self.url = url or os.getenv("QDRANT_URL")
+        self.api_key = api_key or os.getenv("QDRANT_API_KEY")
+        if not self.url or not self.api_key:
+            raise ValueError("Cần cung cấp QDRANT_URL và QDRANT_API_KEY (qua env hoặc params)")
+        print(f"Connecting to Qdrant Cloud...")
+        # Initialize Qdrant Cloud client
+        self.client = QdrantClient(
+            url=self.url,
+            api_key=self.api_key,
+        )
+        self.collection_name = collection_name
+        self.vector_size = vector_size
+        # Create collection nếu chưa tồn tại
+        self._ensure_collection()
+        print(f"✓ Connected to Qdrant collection: {collection_name}")
+    def _ensure_collection(self):
+        """
+        Tạo collection với HNSW config tối ưu nhất
+        """
+        # Check nếu collection đã tồn tại
+        collections = self.client.get_collections().collections
+        collection_exists = any(c.name == self.collection_name for c in collections)
+        if not collection_exists:
+            print(f"Creating collection {self.collection_name} with optimal HNSW config...")
+            self.client.create_collection(
+                collection_name=self.collection_name,
+                vectors_config=VectorParams(
+                    size=self.vector_size,
+                    distance=Distance.COSINE,  # Cosine similarity cho embeddings
+                    hnsw_config=HnswConfigDiff(
+                        m=64,  # Số edges per node - cao nhất cho accuracy
+                        ef_construct=512,  # Search range khi build index - cao cho quality
+                        full_scan_threshold=10000,  # Threshold để switch sang full scan
+                        max_indexing_threads=0,  # Auto-detect số threads
+                        on_disk=False,  # Keep trong RAM cho speed (nếu đủ memory)
+                    )
+                ),
+                optimizers_config=OptimizersConfigDiff(
+                    deleted_threshold=0.2,
+                    vacuum_min_vector_number=1000,
+                    default_segment_number=2,
+                    max_segment_size=200000,
+                    memmap_threshold=50000,
+                    indexing_threshold=10000,
+                    flush_interval_sec=5,
+                    max_optimization_threads=0,  # Auto-detect
+                ),
+                # Sử dụng Scalar Quantization để tối ưu memory và speed
+                quantization_config=ScalarQuantization(
+                    scalar=ScalarQuantizationConfig(
+                        type=ScalarType.INT8,
+                        quantile=0.99,
+                        always_ram=True,  # Keep quantized vectors trong RAM
+                    )
+                )
+            )
+            print("✓ Collection created with optimal configuration")
+        else:
+            print("✓ Collection already exists")
+    def _convert_to_valid_id(self, doc_id: str) -> str:
+        """
+        Convert bất kỳ string ID nào thành UUID hợp lệ cho Qdrant
+        Args:
+            doc_id: Original ID (có thể là MongoDB ObjectId, string, etc.)
+        Returns:
+            UUID string hợp lệ
+        """
+        if not doc_id:
+            return str(uuid.uuid4())
+        # Nếu đã là UUID hợp lệ, giữ nguyên
+        try:
+            uuid.UUID(doc_id)
+            return doc_id
+        except ValueError:
+            pass
+        # Convert string sang UUID deterministic (cùng input = cùng UUID)
+        # Sử dụng UUID v5 với namespace DNS
+        return str(uuid.uuid5(uuid.NAMESPACE_DNS, doc_id))
+    def index_data(
+        self,
+        doc_id: str,
+        embedding: np.ndarray,
+        metadata: Dict[str, Any]
+    ) -> Dict[str, str]:
+        """
+        Index data vào Qdrant
+        Args:
+            doc_id: ID của document (MongoDB ObjectId, string, etc.)
+            embedding: Vector embedding từ Jina CLIP
+            metadata: Metadata (text, image_url, event_info, etc.)
+        Returns:
+            Dict với original_id và qdrant_id
+        """
+        # Convert ID thành UUID hợp lệ
+        qdrant_id = self._convert_to_valid_id(doc_id)
+        # Lưu original ID vào metadata
+        metadata['original_id'] = doc_id
+        # Ensure embedding là 1D array
+        if len(embedding.shape) > 1:
+            embedding = embedding.flatten()
+        # Create point
+        point = PointStruct(
+            id=qdrant_id,
+            vector=embedding.tolist(),
+            payload=metadata
+        )
+        # Upsert vào collection
+        self.client.upsert(
+            collection_name=self.collection_name,
+            points=[point]
+        )
+        return {
+            "original_id": doc_id,
+            "qdrant_id": qdrant_id
+        }
+    def batch_index(
+        self,
+        doc_ids: List[str],
+        embeddings: np.ndarray,
+        metadata_list: List[Dict[str, Any]]
+    ) -> List[Dict[str, str]]:
+        """
+        Batch index nhiều documents cùng lúc
+        Args:
+            doc_ids: List of document IDs (MongoDB ObjectId, string, etc.)
+            embeddings: Numpy array of embeddings (n_samples, embedding_dim)
+            metadata_list: List of metadata dicts
+        Returns:
+            List of dicts với original_id và qdrant_id
+        """
+        points = []
+        id_mappings = []
+        for i, (doc_id, embedding, metadata) in enumerate(zip(doc_ids, embeddings, metadata_list)):
+            # Convert to valid UUID
+            qdrant_id = self._convert_to_valid_id(doc_id)
+            # Lưu original ID vào metadata
+            metadata['original_id'] = doc_id
+            # Ensure embedding là 1D
+            if len(embedding.shape) > 1:
+                embedding = embedding.flatten()
+            points.append(PointStruct(
+                id=qdrant_id,
+                vector=embedding.tolist(),
+                payload=metadata
+            ))
+            id_mappings.append({
+                "original_id": doc_id,
+                "qdrant_id": qdrant_id
+            })
+        # Batch upsert
+        self.client.upsert(
+            collection_name=self.collection_name,
+            points=points,
+            wait=True  # Wait for indexing to complete
+        )
+        return id_mappings
+    def search(
+        self,
+        query_embedding: np.ndarray,
+        limit: int = 10,
+        score_threshold: Optional[float] = None,
+        filter_conditions: Optional[Dict] = None,
+        ef: int = 256  # Search quality parameter - cao hơn = accurate hơn
+    ) -> List[Dict[str, Any]]:
+        """
+        Search similar vectors trong Qdrant
+        Args:
+            query_embedding: Query embedding từ Jina CLIP
+            limit: Số lượng results trả về
+            score_threshold: Minimum similarity score (0-1)
+            filter_conditions: Qdrant filter conditions
+            ef: HNSW search parameter (128-512, cao hơn = accurate hơn)
+        Returns:
+            List of search results với id, score, và metadata
+        """
+        # Ensure query embedding là 1D
+        if len(query_embedding.shape) > 1:
+            query_embedding = query_embedding.flatten()
+        # Search với HNSW parameters tối ưu
+        search_result = self.client.search(
+            collection_name=self.collection_name,
+            query_vector=query_embedding.tolist(),
+            limit=limit,
+            score_threshold=score_threshold,
+            query_filter=filter_conditions,
+            search_params=SearchParams(
+                hnsw_ef=ef,  # Higher ef = more accurate search
+                exact=False,  # Use HNSW (not exact search)
+                quantization=QuantizationSearchParams(
+                    ignore=False,  # Use quantization
+                    rescore=True,  # Rescore với original vectors
+                    oversampling=2.0  # Oversample factor
+                )
+            ),
+            with_payload=True,
+            with_vectors=False  # Không cần return vectors
+        )
+        # Format results - trả về original_id thay vì UUID
+        results = []
+        for hit in search_result:
+            # Lấy original_id từ metadata (MongoDB ObjectId)
+            original_id = hit.payload.get('original_id', hit.id)
+            results.append({
+                "id": original_id,  # Trả về MongoDB ObjectId
+                "qdrant_id": hit.id,  # UUID trong Qdrant
+                "confidence": float(hit.score),  # Cosine similarity score
+                "metadata": hit.payload
+            })
+        return results
+    def hybrid_search(
+        self,
+        text_embedding: Optional[np.ndarray] = None,
+        image_embedding: Optional[np.ndarray] = None,
+        text_weight: float = 0.5,
+        image_weight: float = 0.5,
+        limit: int = 10,
+        score_threshold: Optional[float] = None,
+        ef: int = 256
+    ) -> List[Dict[str, Any]]:
+        """
+        Hybrid search với cả text và image embeddings
+        Args:
+            text_embedding: Text query embedding
+            image_embedding: Image query embedding
+            text_weight: Weight cho text search (0-1)
+            image_weight: Weight cho image search (0-1)
+            limit: Số results
+            score_threshold: Minimum score
+            ef: HNSW search parameter
+        Returns:
+            Combined search results
+        """
+        # Combine embeddings với weights
+        combined_embedding = np.zeros(self.vector_size)
+        if text_embedding is not None:
+            if len(text_embedding.shape) > 1:
+                text_embedding = text_embedding.flatten()
+            combined_embedding += text_weight * text_embedding
+        if image_embedding is not None:
+            if len(image_embedding.shape) > 1:
+                image_embedding = image_embedding.flatten()
+            combined_embedding += image_weight * image_embedding
+        # Normalize combined embedding
+        norm = np.linalg.norm(combined_embedding)
+        if norm > 0:
+            combined_embedding = combined_embedding / norm
+        # Search với combined embedding
+        return self.search(
+            query_embedding=combined_embedding,
+            limit=limit,
+            score_threshold=score_threshold,
+            ef=ef
+        )
+    def delete_by_id(self, doc_id: str) -> bool:
+        """
+        Delete document by ID (hỗ trợ cả MongoDB ObjectId và UUID)
+        Args:
+            doc_id: Document ID to delete (MongoDB ObjectId hoặc UUID)
+        Returns:
+            Success status
+        """
+        # Convert to UUID nếu là MongoDB ObjectId
+        qdrant_id = self._convert_to_valid_id(doc_id)
+        self.client.delete(
+            collection_name=self.collection_name,
+            points_selector=[qdrant_id]
+        )
+        return True
+    def get_by_id(self, doc_id: str) -> Optional[Dict[str, Any]]:
+        """
+        Get document by ID (hỗ trợ cả MongoDB ObjectId và UUID)
+        Args:
+            doc_id: Document ID (MongoDB ObjectId hoặc UUID)
+        Returns:
+            Document data hoặc None nếu không tìm thấy
+        """
+        # Convert to UUID nếu là MongoDB ObjectId
+        qdrant_id = self._convert_to_valid_id(doc_id)
+        try:
+            result = self.client.retrieve(
+                collection_name=self.collection_name,
+                ids=[qdrant_id],
+                with_payload=True,
+                with_vectors=False
+            )
+            if result:
+                point = result[0]
+                original_id = point.payload.get('original_id', point.id)
+                return {
+                    "id": original_id,  # MongoDB ObjectId
+                    "qdrant_id": point.id,  # UUID trong Qdrant
+                    "metadata": point.payload
+                }
+            return None
+        except Exception as e:
+            print(f"Error retrieving document: {e}")
+            return None
+    def search_by_metadata(
+        self,
+        filter_conditions: Dict,
+        limit: int = 100
+    ) -> List[Dict[str, Any]]:
+        """
+        Search documents by metadata conditions (không cần embedding)
+        Args:
+            filter_conditions: Qdrant filter conditions
+            limit: Maximum số results
+        Returns:
+            List of matching documents
+        """
+        try:
+            result = self.client.scroll(
+                collection_name=self.collection_name,
+                scroll_filter=filter_conditions,
+                limit=limit,
+                with_payload=True,
+                with_vectors=False
+            )
+            documents = []
+            for point in result[0]:  # result is tuple (points, next_page_offset)
+                original_id = point.payload.get('original_id', point.id)
+                documents.append({
+                    "id": original_id,  # MongoDB ObjectId
+                    "qdrant_id": point.id,  # UUID trong Qdrant
+                    "metadata": point.payload
+                })
+            return documents
+        except Exception as e:
+            print(f"Error searching by metadata: {e}")
+            return []
+    def get_collection_info(self) -> Dict[str, Any]:
+        """
+        Lấy thông tin collection
+        Returns:
+            Collection info
+        """
+        info = self.client.get_collection(collection_name=self.collection_name)
+        return {
+            "vectors_count": info.vectors_count,
+            "points_count": info.points_count,
+            "status": info.status,
+            "config": {
+                "distance": info.config.params.vectors.distance,
+                "size": info.config.params.vectors.size,
+            }
+        }

requirements.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+# FastAPI và web framework
+fastapi==0.115.5
+uvicorn[standard]==0.32.1
+python-multipart==0.0.20
+# Gradio cho Hugging Face Spaces
+gradio>=4.0.0
+# Machine Learning & Embeddings
+torch>=2.0.0
+transformers>=4.50.0
+onnxruntime==1.20.1
+torchvision>=0.15.0
+pillow>=10.0.0
+numpy>=1.24.0
+# Vector Database
+qdrant-client>=1.12.1
+grpcio>=1.60.0
+# Utilities
+pydantic>=2.0.0
+python-dotenv==1.0.0
+# MongoDB
+pymongo>=4.6.0
+huggingface-hub>=0.20.0
+timm
+einops
+# PDF Processing
+pypdfium2>=4.30.0

test_advanced_features.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+Test script for Advanced RAG features
+Demonstrates new capabilities: multiple texts/images indexing and advanced RAG chat
+"""
+import requests
+import json
+from typing import List, Optional
+class AdvancedRAGTester:
+    """Test client for Advanced RAG API"""
+    def __init__(self, base_url: str = "http://localhost:8000"):
+        self.base_url = base_url
+    def test_multiple_index(self, doc_id: str, texts: List[str], image_paths: Optional[List[str]] = None):
+        """
+        Test indexing with multiple texts and images
+        Args:
+            doc_id: Document ID
+            texts: List of texts (max 10)
+            image_paths: List of image file paths (max 10)
+        """
+        print(f"\n{'='*60}")
+        print(f"TEST: Indexing document '{doc_id}' with multiple texts/images")
+        print(f"{'='*60}")
+        # Prepare form data
+        data = {'id': doc_id}
+        # Add texts
+        if texts:
+            if len(texts) > 10:
+                print("WARNING: Maximum 10 texts allowed. Taking first 10.")
+                texts = texts[:10]
+            data['texts'] = texts
+            print(f"✓ Texts: {len(texts)} items")
+        # Prepare files
+        files = []
+        if image_paths:
+            if len(image_paths) > 10:
+                print("WARNING: Maximum 10 images allowed. Taking first 10.")
+                image_paths = image_paths[:10]
+            for img_path in image_paths:
+                try:
+                    files.append(('images', open(img_path, 'rb')))
+                except FileNotFoundError:
+                    print(f"WARNING: Image not found: {img_path}")
+            print(f"✓ Images: {len(files)} files")
+        # Make request
+        try:
+            response = requests.post(f"{self.base_url}/index", data=data, files=files)
+            response.raise_for_status()
+            result = response.json()
+            print(f"\n✓ SUCCESS")
+            print(f"  - Document ID: {result['id']}")
+            print(f"  - Message: {result['message']}")
+            return result
+        except requests.exceptions.RequestException as e:
+            print(f"\n✗ ERROR: {e}")
+            if hasattr(e.response, 'text'):
+                print(f"  Response: {e.response.text}")
+            return None
+        finally:
+            # Close file handles
+            for _, file_obj in files:
+                file_obj.close()
+    def test_advanced_rag_chat(
+        self,
+        message: str,
+        hf_token: Optional[str] = None,
+        use_advanced_rag: bool = True,
+        use_reranking: bool = True,
+        use_compression: bool = True,
+        top_k: int = 3,
+        score_threshold: float = 0.5
+    ):
+        """
+        Test advanced RAG chat
+        Args:
+            message: User question
+            hf_token: Hugging Face token (optional)
+            use_advanced_rag: Use advanced RAG pipeline
+            use_reranking: Enable reranking
+            use_compression: Enable context compression
+            top_k: Number of documents to retrieve
+            score_threshold: Minimum relevance score
+        """
+        print(f"\n{'='*60}")
+        print(f"TEST: Advanced RAG Chat")
+        print(f"{'='*60}")
+        print(f"Question: {message}")
+        print(f"Advanced RAG: {use_advanced_rag}")
+        print(f"Reranking: {use_reranking}")
+        print(f"Compression: {use_compression}")
+        payload = {
+            'message': message,
+            'use_rag': True,
+            'use_advanced_rag': use_advanced_rag,
+            'use_reranking': use_reranking,
+            'use_compression': use_compression,
+            'top_k': top_k,
+            'score_threshold': score_threshold,
+        }
+        if hf_token:
+            payload['hf_token'] = hf_token
+        try:
+            response = requests.post(f"{self.base_url}/chat", json=payload)
+            response.raise_for_status()
+            result = response.json()
+            print(f"\n✓ SUCCESS")
+            print(f"\n--- Answer ---")
+            print(result['response'])
+            print(f"\n--- Retrieved Context ({len(result['context_used'])} documents) ---")
+            for i, ctx in enumerate(result['context_used'], 1):
+                print(f"{i}. [{ctx['id']}] Confidence: {ctx['confidence']:.2%}")
+                text_preview = ctx['metadata'].get('text', '')[:100]
+                print(f"   Text: {text_preview}...")
+            if result.get('rag_stats'):
+                print(f"\n--- RAG Pipeline Statistics ---")
+                stats = result['rag_stats']
+                print(f"  Original query: {stats.get('original_query')}")
+                print(f"  Expanded queries: {stats.get('expanded_queries')}")
+                print(f"  Initial results: {stats.get('initial_results')}")
+                print(f"  After reranking: {stats.get('after_rerank')}")
+                print(f"  After compression: {stats.get('after_compression')}")
+            return result
+        except requests.exceptions.RequestException as e:
+            print(f"\n✗ ERROR: {e}")
+            if hasattr(e.response, 'text'):
+                print(f"  Response: {e.response.text}")
+            return None
+    def compare_basic_vs_advanced_rag(self, message: str, hf_token: Optional[str] = None):
+        """Compare basic RAG vs advanced RAG side by side"""
+        print(f"\n{'='*60}")
+        print(f"COMPARISON: Basic RAG vs Advanced RAG")
+        print(f"{'='*60}")
+        print(f"Question: {message}\n")
+        # Test Basic RAG
+        print("\n--- BASIC RAG ---")
+        basic_result = self.test_advanced_rag_chat(
+            message=message,
+            hf_token=hf_token,
+            use_advanced_rag=False
+        )
+        # Test Advanced RAG
+        print("\n--- ADVANCED RAG ---")
+        advanced_result = self.test_advanced_rag_chat(
+            message=message,
+            hf_token=hf_token,
+            use_advanced_rag=True
+        )
+        # Compare
+        print(f"\n{'='*60}")
+        print("COMPARISON SUMMARY")
+        print(f"{'='*60}")
+        if basic_result and advanced_result:
+            print(f"Basic RAG:")
+            print(f"  - Retrieved docs: {len(basic_result['context_used'])}")
+            print(f"\nAdvanced RAG:")
+            print(f"  - Retrieved docs: {len(advanced_result['context_used'])}")
+            if advanced_result.get('rag_stats'):
+                stats = advanced_result['rag_stats']
+                print(f"  - Query expansion: {len(stats.get('expanded_queries', []))} variants")
+                print(f"  - Initial retrieval: {stats.get('initial_results', 0)} docs")
+                print(f"  - After reranking: {stats.get('after_rerank', 0)} docs")
+def main():
+    """Run tests"""
+    tester = AdvancedRAGTester()
+    print("="*60)
+    print("ADVANCED RAG FEATURE TESTS")
+    print("="*60)
+    # Test 1: Index with multiple texts (no images for demo)
+    print("\n\n### TEST 1: Index Multiple Texts ###")
+    tester.test_multiple_index(
+        doc_id="event_music_festival_2025",
+        texts=[
+            "Festival âm nhạc quốc tế Hà Nội 2025",
+            "Thời gian: 15-17 tháng 11 năm 2025",
+            "Địa điểm: Công viên Thống Nhất, Hà Nội",
+            "Line-up: Sơn Tùng MTP, Đen Vâu, Hoàng Thùy Linh, Mỹ Tâm",
+            "Giá vé: Early bird 500.000đ, VIP 2.000.000đ",
+            "Dự kiến 50.000 khán giả tham dự",
+            "3 sân khấu chính, 5 food court, khu vực cắm trại"
+        ]
+    )
+    # Test 2: Index another document
+    print("\n\n### TEST 2: Index Another Document ###")
+    tester.test_multiple_index(
+        doc_id="safety_guidelines",
+        texts=[
+            "Vũ khí và đồ vật nguy hiểm bị cấm mang vào sự kiện",
+            "Dao, kiếm, súng và các loại vũ khí nguy hiểm nghiêm cấm",
+            "An ninh sẽ kiểm tra tất cả túi xách và đồ mang theo",
+            "Vi phạm sẽ bị tịch thu và có thể bị trục xuất khỏi sự kiện"
+        ]
+    )
+    # Test 3: Basic chat (without HF token - will show placeholder)
+    print("\n\n### TEST 3: Basic RAG Chat (No LLM) ###")
+    tester.test_advanced_rag_chat(
+        message="Festival Hà Nội diễn ra khi nào?",
+        use_advanced_rag=False
+    )
+    # Test 4: Advanced RAG chat
+    print("\n\n### TEST 4: Advanced RAG Chat (No LLM) ###")
+    tester.test_advanced_rag_chat(
+        message="Festival Hà Nội diễn ra khi nào và có những nghệ sĩ nào?",
+        use_advanced_rag=True,
+        use_reranking=True,
+        use_compression=True
+    )
+    # Test 5: Compare basic vs advanced
+    print("\n\n### TEST 5: Comparison Test ###")
+    tester.compare_basic_vs_advanced_rag(
+        message="Dao có được mang vào sự kiện không?"
+    )
+    print("\n\n" + "="*60)
+    print("ALL TESTS COMPLETED")
+    print("="*60)
+    print("\nNOTE: To test with actual LLM responses, add your Hugging Face token:")
+    print("  tester.test_advanced_rag_chat(message='...', hf_token='hf_xxxxx')")
+if __name__ == "__main__":
+    main()

verify_dependencies.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""
+Verify all dependencies are installed correctly
+Run: python verify_dependencies.py
+"""
+import sys
+def check_dependency(module_name, package_name=None):
+    """Check if a dependency is installed"""
+    if package_name is None:
+        package_name = module_name
+    try:
+        __import__(module_name)
+        print(f"✓ {package_name}")
+        return True
+    except ImportError as e:
+        print(f"✗ {package_name} - NOT INSTALLED")
+        print(f"  Error: {e}")
+        return False
+def main():
+    print("="*60)
+    print("Dependency Verification")
+    print("="*60)
+    dependencies = [
+        # Web framework
+        ("fastapi", "fastapi"),
+        ("uvicorn", "uvicorn"),
+        ("multipart", "python-multipart"),
+        # ML & Embeddings
+        ("torch", "torch"),
+        ("transformers", "transformers"),
+        ("PIL", "pillow"),
+        ("numpy", "numpy"),
+        # Vector DB
+        ("qdrant_client", "qdrant-client"),
+        # Utilities
+        ("pydantic", "pydantic"),
+        ("dotenv", "python-dotenv"),
+        # MongoDB
+        ("pymongo", "pymongo"),
+        ("huggingface_hub", "huggingface-hub"),
+        ("timm", "timm"),
+        ("einops", "einops"),
+        # PDF Processing (NEW)
+        ("pypdfium2", "pypdfium2"),
+    ]
+    print("\nChecking dependencies...\n")
+    all_ok = True
+    for module, package in dependencies:
+        if not check_dependency(module, package):
+            all_ok = False
+    print("\n" + "="*60)
+    if all_ok:
+        print("✓ All dependencies installed successfully!")
+        print("\nYou can now run:")
+        print("  python main.py")
+    else:
+        print("✗ Some dependencies are missing!")
+        print("\nPlease install missing dependencies:")
+        print("  pip install -r requirements.txt")
+        sys.exit(1)
+    print("="*60)
+    # Check optional features
+    print("\nChecking system modules...\n")
+    # Check our custom modules
+    custom_modules = [
+        "embedding_service",
+        "qdrant_service",
+        "advanced_rag",
+        "pdf_parser",
+        "multimodal_pdf_parser",
+    ]
+    for module in custom_modules:
+        try:
+            __import__(module)
+            print(f"✓ {module}.py")
+        except ImportError as e:
+            print(f"✗ {module}.py - ERROR: {e}")
+    print("\n" + "="*60)
+    print("Verification complete!")
+    print("="*60)
+if __name__ == "__main__":
+    main()