Haiss123 commited on
Commit
6b98b09
·
verified ·
1 Parent(s): 82189df

Upload 20 files

Browse files
ADVANCED_RAG_GUIDE.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Advanced RAG Chatbot - User Guide
2
+
3
+ ## What's New?
4
+
5
+ ### 1. Multiple Images & Texts Support in `/index` API
6
+
7
+ The `/index` endpoint now supports indexing multiple texts and images in a single request (max 10 each).
8
+
9
+ **Before:**
10
+ ```python
11
+ # Old: Only 1 text and 1 image
12
+ data = {
13
+ 'id': 'doc1',
14
+ 'text': 'Single text',
15
+ }
16
+ files = {'image': open('image.jpg', 'rb')}
17
+ ```
18
+
19
+ **After:**
20
+ ```python
21
+ # New: Multiple texts and images (max 10 each)
22
+ data = {
23
+ 'id': 'doc1',
24
+ 'texts': ['Text 1', 'Text 2', 'Text 3'], # Up to 10
25
+ }
26
+ files = [
27
+ ('images', open('image1.jpg', 'rb')),
28
+ ('images', open('image2.jpg', 'rb')),
29
+ ('images', open('image3.jpg', 'rb')), # Up to 10
30
+ ]
31
+ response = requests.post('http://localhost:8000/index', data=data, files=files)
32
+ ```
33
+
34
+ **Example with cURL:**
35
+ ```bash
36
+ curl -X POST "http://localhost:8000/index" \
37
+ -F "id=event123" \
38
+ -F "texts=Sự kiện âm nhạc tại Hà Nội" \
39
+ -F "texts=Diễn ra vào ngày 20/10/2025" \
40
+ -F "texts=Địa điểm: Trung tâm Hội nghị Quốc gia" \
41
42
43
44
+ ```
45
+
46
+ ### 2. Advanced RAG Pipeline in `/chat` API
47
+
48
+ The chat endpoint now uses modern RAG techniques for better response quality:
49
+
50
+ #### Key Improvements:
51
+
52
+ 1. **Query Expansion**: Automatically expands your question with variations
53
+ 2. **Multi-Query Retrieval**: Searches with multiple query variants
54
+ 3. **Reranking**: Re-scores results for better relevance
55
+ 4. **Contextual Compression**: Keeps only the most relevant parts
56
+ 5. **Better Prompt Engineering**: Optimized prompts for LLM
57
+
58
+ #### How to Use:
59
+
60
+ **Basic Usage (Auto-enabled):**
61
+ ```python
62
+ import requests
63
+
64
+ response = requests.post('http://localhost:8000/chat', json={
65
+ 'message': 'Dao có nguy hiểm không?',
66
+ 'use_rag': True,
67
+ 'use_advanced_rag': True, # Default: True
68
+ 'hf_token': 'hf_xxxxx'
69
+ })
70
+
71
+ result = response.json()
72
+ print("Response:", result['response'])
73
+ print("RAG Stats:", result['rag_stats']) # See pipeline statistics
74
+ ```
75
+
76
+ **Advanced Configuration:**
77
+ ```python
78
+ response = requests.post('http://localhost:8000/chat', json={
79
+ 'message': 'Làm sao để tạo event mới?',
80
+ 'use_rag': True,
81
+ 'use_advanced_rag': True,
82
+
83
+ # RAG Pipeline Options
84
+ 'use_query_expansion': True, # Expand query with variations
85
+ 'use_reranking': True, # Rerank results
86
+ 'use_compression': True, # Compress context
87
+ 'score_threshold': 0.5, # Min relevance score (0-1)
88
+ 'top_k': 5, # Number of documents to retrieve
89
+
90
+ # LLM Options
91
+ 'max_tokens': 512,
92
+ 'temperature': 0.7,
93
+ 'hf_token': 'hf_xxxxx'
94
+ })
95
+ ```
96
+
97
+ **Disable Advanced RAG (Use Basic):**
98
+ ```python
99
+ response = requests.post('http://localhost:8000/chat', json={
100
+ 'message': 'Your question',
101
+ 'use_rag': True,
102
+ 'use_advanced_rag': False, # Use basic RAG
103
+ })
104
+ ```
105
+
106
+ ## API Changes Summary
107
+
108
+ ### `/index` Endpoint
109
+
110
+ **Old Parameters:**
111
+ - `id`: str (required)
112
+ - `text`: str (required)
113
+ - `image`: UploadFile (optional)
114
+
115
+ **New Parameters:**
116
+ - `id`: str (required)
117
+ - `texts`: List[str] (optional, max 10)
118
+ - `images`: List[UploadFile] (optional, max 10)
119
+
120
+ **Response:**
121
+ ```json
122
+ {
123
+ "success": true,
124
+ "id": "doc123",
125
+ "message": "Đã index thành công document doc123 với 3 texts và 2 images"
126
+ }
127
+ ```
128
+
129
+ ### `/chat` Endpoint
130
+
131
+ **New Parameters:**
132
+ - `use_advanced_rag`: bool (default: True) - Enable advanced RAG
133
+ - `use_query_expansion`: bool (default: True) - Expand query
134
+ - `use_reranking`: bool (default: True) - Rerank results
135
+ - `use_compression`: bool (default: True) - Compress context
136
+ - `score_threshold`: float (default: 0.5) - Min relevance score
137
+
138
+ **Response (New):**
139
+ ```json
140
+ {
141
+ "response": "AI generated answer...",
142
+ "context_used": [...],
143
+ "timestamp": "2025-10-29T...",
144
+ "rag_stats": {
145
+ "original_query": "Your question",
146
+ "expanded_queries": ["Query variant 1", "Query variant 2"],
147
+ "initial_results": 10,
148
+ "after_rerank": 5,
149
+ "after_compression": 5
150
+ }
151
+ }
152
+ ```
153
+
154
+ ## Complete Examples
155
+
156
+ ### Example 1: Index Multiple Social Media Posts
157
+
158
+ ```python
159
+ import requests
160
+
161
+ # Index a social media event with multiple posts and images
162
+ data = {
163
+ 'id': 'event_festival_2025',
164
+ 'texts': [
165
+ 'Festival âm nhạc quốc tế Hà Nội 2025',
166
+ 'Ngày 15-17 tháng 11 năm 2025',
167
+ 'Địa điểm: Công viên Thống Nhất',
168
+ 'Line-up: Sơn Tùng MTP, Đen Vâu, Hoàng Thùy Linh',
169
+ 'Giá vé từ 500.000đ - 2.000.000đ'
170
+ ]
171
+ }
172
+
173
+ files = [
174
+ ('images', open('poster_festival.jpg', 'rb')),
175
+ ('images', open('lineup.jpg', 'rb')),
176
+ ('images', open('venue_map.jpg', 'rb'))
177
+ ]
178
+
179
+ response = requests.post('http://localhost:8000/index', data=data, files=files)
180
+ print(response.json())
181
+ ```
182
+
183
+ ### Example 2: Advanced RAG Chat
184
+
185
+ ```python
186
+ import requests
187
+
188
+ # Chat with advanced RAG
189
+ chat_response = requests.post('http://localhost:8000/chat', json={
190
+ 'message': 'Festival âm nhạc Hà Nội diễn ra khi nào và ở đâu?',
191
+ 'use_rag': True,
192
+ 'use_advanced_rag': True,
193
+ 'top_k': 3,
194
+ 'score_threshold': 0.6,
195
+ 'hf_token': 'your_hf_token_here'
196
+ })
197
+
198
+ result = chat_response.json()
199
+ print("Answer:", result['response'])
200
+ print("\nRetrieved Context:")
201
+ for ctx in result['context_used']:
202
+ print(f"- [{ctx['id']}] Confidence: {ctx['confidence']:.2%}")
203
+
204
+ print("\nRAG Pipeline Stats:")
205
+ print(f"- Original query: {result['rag_stats']['original_query']}")
206
+ print(f"- Query variants: {result['rag_stats']['expanded_queries']}")
207
+ print(f"- Documents retrieved: {result['rag_stats']['initial_results']}")
208
+ print(f"- After reranking: {result['rag_stats']['after_rerank']}")
209
+ ```
210
+
211
+ ## Performance Comparison
212
+
213
+ | Feature | Basic RAG | Advanced RAG |
214
+ |---------|-----------|--------------|
215
+ | Query Understanding | Single query | Multiple query variants |
216
+ | Retrieval Method | Direct vector search | Multi-query + hybrid |
217
+ | Result Ranking | Score from DB | Reranked with semantic similarity |
218
+ | Context Quality | Full text | Compressed, relevant parts only |
219
+ | Response Accuracy | Good | Better |
220
+ | Response Time | Faster | Slightly slower but better quality |
221
+
222
+ ## When to Use What?
223
+
224
+ **Use Basic RAG when:**
225
+ - You need fast response time
226
+ - Queries are straightforward
227
+ - Context is already well-structured
228
+
229
+ **Use Advanced RAG when:**
230
+ - You need higher accuracy
231
+ - Queries are complex or ambiguous
232
+ - Context documents are long
233
+ - You want better relevance
234
+
235
+ ## Troubleshooting
236
+
237
+ ### Error: "Tối đa 10 texts"
238
+ You're sending more than 10 texts. Reduce to max 10.
239
+
240
+ ### Error: "Tối đa 10 images"
241
+ You're sending more than 10 images. Reduce to max 10.
242
+
243
+ ### RAG stats show 0 results
244
+ Your `score_threshold` might be too high. Try lowering it (e.g., 0.3-0.5).
245
+
246
+ ## Next Steps
247
+
248
+ To further improve RAG, consider:
249
+
250
+ 1. **Add BM25 Hybrid Search**: Combine dense + sparse retrieval
251
+ 2. **Use Cross-Encoder for Reranking**: Better than embedding similarity
252
+ 3. **Implement Query Decomposition**: Break complex queries into sub-queries
253
+ 4. **Add Citation/Source Tracking**: Show which document each fact comes from
254
+ 5. **Integrate RAG-Anything**: For advanced multimodal document processing
255
+
256
+ For RAG-Anything integration (more complex), see: https://github.com/HKUDS/RAG-Anything
MULTIMODAL_PDF_GUIDE.md ADDED
@@ -0,0 +1,525 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multimodal PDF Guide - PDFs với Text + Hình Ảnh
2
+
3
+ ## Tổng Quan
4
+
5
+ Hệ thống giờ hỗ trợ **Multimodal PDF** - PDFs có:
6
+ - ✅ Text hướng dẫn
7
+ - ✅ Image URLs (links đến hình ảnh)
8
+ - ✅ Markdown images: `![alt](url)`
9
+ - ✅ HTML images: `<img src="url">`
10
+
11
+ **Perfect cho**: User guides với screenshots, tutorials với diagrams, documentation với visual aids.
12
+
13
+ ---
14
+
15
+ ## Tại Sao Cần Multimodal?
16
+
17
+ ### Vấn Đề Với PDF Thông Thường
18
+
19
+ PDF hướng dẫn thường có:
20
+ ```
21
+ Bước 1: Mở trang chủ
22
+ [Xem hình ảnh: https://example.com/homepage.png]
23
+
24
+ Bước 2: Click vào "Tạo mới"
25
+ ![Create button](https://example.com/create-button.png)
26
+
27
+ Bước 3: Điền thông tin
28
+ <img src="https://example.com/form.png" alt="Form" />
29
+ ```
30
+
31
+ **PDF parser cũ** chỉ extract text → **MẤT hết image URLs** → Chatbot không biết hình ảnh nào liên quan!
32
+
33
+ **Multimodal PDF parser mới**:
34
+ - ✓ Extract text
35
+ - ✓ Detect tất cả image URLs
36
+ - ✓ Link images với text chunks tương ứng
37
+ - ✓ Store URLs trong metadata
38
+ - ✓ Return images cùng text khi chat
39
+
40
+ ---
41
+
42
+ ## So Sánh: PDF Thường vs Multimodal PDF
43
+
44
+ | Feature | PDF Thường (`/upload-pdf`) | Multimodal PDF (`/upload-pdf-multimodal`) |
45
+ |---------|---------------------------|-------------------------------------------|
46
+ | Extract text | ✓ | ✓ |
47
+ | Detect image URLs | ✗ | ✓ |
48
+ | Link images to chunks | ✗ | ✓ |
49
+ | Return images in chat | ✗ | ✓ |
50
+ | URL formats supported | ✗ | http://, https://, markdown, HTML |
51
+ | Use case | Simple text documents | User guides, tutorials, docs with images |
52
+
53
+ ---
54
+
55
+ ## Cách Sử Dụng
56
+
57
+ ### 1. Upload Multimodal PDF
58
+
59
+ **Endpoint:** `POST /upload-pdf-multimodal`
60
+
61
+ **Curl:**
62
+ ```bash
63
+ curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
64
+ -F "file=@user_guide_with_images.pdf" \
65
+ -F "title=Hướng dẫn sử dụng hệ thống" \
66
+ -F "description=User guide with screenshots" \
67
+ -F "category=user_guide"
68
+ ```
69
+
70
+ **Python:**
71
+ ```python
72
+ import requests
73
+
74
+ with open('user_guide_with_images.pdf', 'rb') as f:
75
+ response = requests.post(
76
+ 'http://localhost:8000/upload-pdf-multimodal',
77
+ files={'file': f},
78
+ data={
79
+ 'title': 'User Guide with Screenshots',
80
+ 'category': 'user_guide'
81
+ }
82
+ )
83
+
84
+ result = response.json()
85
+ print(f"Indexed: {result['chunks_indexed']} chunks")
86
+ print(f"Images found: {result['message']}")
87
+ ```
88
+
89
+ **Response:**
90
+ ```json
91
+ {
92
+ "success": true,
93
+ "document_id": "pdf_multimodal_20251029_150000",
94
+ "filename": "user_guide_with_images.pdf",
95
+ "chunks_indexed": 25,
96
+ "message": "PDF 'user_guide_with_images.pdf' indexed successfully with 25 chunks and 15 images"
97
+ }
98
+ ```
99
+
100
+ ### 2. Chat Với Multimodal Context
101
+
102
+ ```python
103
+ import requests
104
+
105
+ response = requests.post('http://localhost:8000/chat', json={
106
+ 'message': 'Làm sao để tạo event mới?',
107
+ 'use_rag': True,
108
+ 'use_advanced_rag': True,
109
+ 'top_k': 3,
110
+ 'hf_token': 'your_token'
111
+ })
112
+
113
+ result = response.json()
114
+
115
+ # Response text
116
+ print("Answer:", result['response'])
117
+
118
+ # Retrieved context with images
119
+ for ctx in result['context_used']:
120
+ print(f"\n--- Source: Page {ctx['metadata']['page']} ---")
121
+ print(f"Text: {ctx['metadata']['text'][:200]}...")
122
+
123
+ # Check if this chunk has images
124
+ if ctx['metadata'].get('has_images'):
125
+ print(f"Images ({ctx['metadata']['num_images']}):")
126
+ for img_url in ctx['metadata'].get('image_urls', []):
127
+ print(f" - {img_url}")
128
+ ```
129
+
130
+ **Example Output:**
131
+ ```
132
+ Answer: Để tạo event mới, bạn thực hiện các bước sau:
133
+ 1. Mở trang chủ và click vào nút "Tạo Event" (xem hình minh họa)
134
+ 2. Điền thông tin event...
135
+
136
+ --- Source: Page 5 ---
137
+ Text: Bước 1: Mở trang chủ và click vào nút "Tạo Event"...
138
+ Images (2):
139
+ - https://example.com/homepage.png
140
+ - https://example.com/create-button.png
141
+ ```
142
+
143
+ ---
144
+
145
+ ## Cách Chuẩn Bị PDF
146
+
147
+ ### Format Hỗ Trợ
148
+
149
+ Multimodal parser detect các format sau:
150
+
151
+ 1. **Standard URLs:**
152
+ ```
153
+ Xem hình: https://example.com/image.png
154
+ Screenshot: http://cdn.example.com/screenshot.jpg
155
+ ```
156
+
157
+ 2. **Markdown Images:**
158
+ ```markdown
159
+ ![Homepage](https://example.com/homepage.png)
160
+ ![Button](https://example.com/button.png)
161
+ ```
162
+
163
+ 3. **HTML Images:**
164
+ ```html
165
+ <img src="https://example.com/form.png" alt="Form" />
166
+ <img src="http://example.com/result.jpg">
167
+ ```
168
+
169
+ 4. **Image Extensions:**
170
+ ```
171
+ https://example.com/pic.jpg
172
+ https://example.com/chart.png
173
+ https://example.com/diagram.svg
174
+ ```
175
+
176
+ ### Best Practices
177
+
178
+ #### ✓ Tốt
179
+
180
+ **PDF Content Example:**
181
+ ```
182
+ # Hướng Dẫn Tạo Event
183
+
184
+ ## Bước 1: Mở Trang Chủ
185
+
186
+ Truy cập vào trang chủ hệ thống tại homepage.
187
+
188
+ ![Homepage Screenshot](https://docs.example.com/images/homepage.png)
189
+
190
+ Bạn sẽ thấy màn hình chính với menu bên trái.
191
+
192
+ ## Bước 2: Click "Tạo Event"
193
+
194
+ Tìm và click vào nút "Tạo Event" ở góc trên phải.
195
+
196
+ ![Create Event Button](https://docs.example.com/images/create-button.png)
197
+
198
+ ## Bước 3: Điền Thông Tin
199
+
200
+ Điền các thông tin sau vào form:
201
+ - Tên event
202
+ - Ngày giờ
203
+ - Địa điểm
204
+
205
+ Xem mẫu form: https://docs.example.com/images/event-form.png
206
+ ```
207
+
208
+ **Why good:**
209
+ - Có cấu trúc rõ ràng (headings)
210
+ - Mỗi bước có text + hình ảnh
211
+ - URLs rõ ràng, dễ detect
212
+ - Context gắn chặt với hình
213
+
214
+ #### ✗ Tránh
215
+
216
+ ```
217
+ Xem các hình dưới đây [1] [2] [3]
218
+
219
+ [Các hình ảnh ở cuối tài liệu]
220
+
221
+ ...
222
+
223
+ [1] homepage.png
224
+ [2] button.png
225
+ [3] form.png
226
+ ```
227
+
228
+ **Why bad:**
229
+ - Images references không có URLs
230
+ - Images tách biệt khỏi context
231
+ - Không có full URLs (chỉ filenames)
232
+
233
+ ---
234
+
235
+ ## Ví Dụ Thực Tế
236
+
237
+ ### Tạo PDF Hướng Dẫn Multimodal
238
+
239
+ **File: `chatbot_guide_with_images.md`**
240
+
241
+ ```markdown
242
+ # Hướng Dẫn Sử Dụng ChatbotRAG
243
+
244
+ ## 1. Upload PDF
245
+
246
+ ### Bước 1: Chuẩn bị file PDF
247
+
248
+ Đảm bảo file PDF của bạn đã sẵn sàng.
249
+
250
+ ![PDF File Icon](https://via.placeholder.com/150?text=PDF+File)
251
+
252
+ ### Bước 2: Sử dụng cURL hoặc Python
253
+
254
+ **Với cURL:**
255
+
256
+ \`\`\`bash
257
+ curl -X POST "http://localhost:8000/upload-pdf-multimodal" \\
258
+ -F "file=@your_file.pdf"
259
+ \`\`\`
260
+
261
+ ![cURL Command Example](https://via.placeholder.com/400x100?text=cURL+Command)
262
+
263
+ **Với Python:**
264
+
265
+ \`\`\`python
266
+ import requests
267
+ # Upload code here
268
+ \`\`\`
269
+
270
+ ### Bước 3: Verify Upload
271
+
272
+ Kiểm tra kết quả upload:
273
+
274
+ https://via.placeholder.com/500x300?text=Upload+Success+Message
275
+
276
+ ## 2. Chat Với Chatbot
277
+
278
+ Sau khi upload, bạn có thể hỏi chatbot:
279
+
280
+ ![Chat Interface](https://via.placeholder.com/600x400?text=Chat+Interface)
281
+
282
+ **Ví dụ câu hỏi:**
283
+ - "Làm sao để upload PDF?"
284
+ - "Các bước tạo event là gì?"
285
+
286
+ ![Chat Example](https://via.placeholder.com/600x300?text=Chat+Example)
287
+
288
+ ## 3. Xem Kết Quả
289
+
290
+ Chatbot sẽ trả lời dựa trên PDF content:
291
+
292
+ https://via.placeholder.com/600x350?text=Chat+Response+with+Images
293
+ ```
294
+
295
+ **Convert to PDF:**
296
+ ```bash
297
+ pandoc chatbot_guide_with_images.md -o chatbot_guide_with_images.pdf
298
+ ```
299
+
300
+ **Upload:**
301
+ ```bash
302
+ curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
303
+ -F "file=@chatbot_guide_with_images.pdf" \
304
+ -F "title=ChatbotRAG Guide" \
305
+ -F "category=user_guide"
306
+ ```
307
+
308
+ ---
309
+
310
+ ## Advanced: Custom Image Handling
311
+
312
+ ### Option 1: Local Images
313
+
314
+ Nếu images ở local, bạn cần host chúng:
315
+
316
+ ```bash
317
+ # Simple HTTP server
318
+ cd /path/to/images
319
+ python -m http.server 8080
320
+
321
+ # Images available at:
322
+ # http://localhost:8080/image1.png
323
+ # http://localhost:8080/image2.png
324
+ ```
325
+
326
+ Trong PDF, reference:
327
+ ```
328
+ ![Image](http://localhost:8080/image1.png)
329
+ ```
330
+
331
+ ### Option 2: Cloud Storage
332
+
333
+ Upload images lên cloud (AWS S3, Cloudinary, Imgur, etc.):
334
+
335
+ ```python
336
+ # Example: Upload to Imgur
337
+ import requests
338
+
339
+ def upload_to_imgur(image_path):
340
+ client_id = 'YOUR_CLIENT_ID'
341
+ headers = {'Authorization': f'Client-ID {client_id}'}
342
+
343
+ with open(image_path, 'rb') as img:
344
+ response = requests.post(
345
+ 'https://api.imgur.com/3/image',
346
+ headers=headers,
347
+ files={'image': img}
348
+ )
349
+
350
+ return response.json()['data']['link']
351
+
352
+ # Upload images
353
+ url1 = upload_to_imgur('screenshot1.png')
354
+ url2 = upload_to_imgur('screenshot2.png')
355
+
356
+ # Use URLs in PDF
357
+ print(f"![Screenshot 1]({url1})")
358
+ ```
359
+
360
+ ### Option 3: Embed Images as Base64
361
+
362
+ Nếu PDF có images embedded, extract chúng:
363
+
364
+ ```python
365
+ import pypdfium2 as pdfium
366
+ from PIL import Image
367
+ import io
368
+ import base64
369
+
370
+ def extract_images_from_pdf(pdf_path):
371
+ """Extract embedded images from PDF"""
372
+ pdf = pdfium.PdfDocument(pdf_path)
373
+ images = []
374
+
375
+ for page_num in range(len(pdf)):
376
+ page = pdf[page_num]
377
+ # Render page as image
378
+ bitmap = page.render(scale=2.0)
379
+ pil_image = bitmap.to_pil()
380
+
381
+ # Save or convert to base64
382
+ buffered = io.BytesIO()
383
+ pil_image.save(buffered, format="PNG")
384
+ img_str = base64.b64encode(buffered.getvalue()).decode()
385
+
386
+ images.append({
387
+ 'page': page_num + 1,
388
+ 'base64': img_str,
389
+ 'url': f'data:image/png;base64,{img_str}'
390
+ })
391
+
392
+ return images
393
+ ```
394
+
395
+ ---
396
+
397
+ ## Troubleshooting
398
+
399
+ ### Images không được detect
400
+
401
+ **Nguyên nhân:**
402
+ - URLs không đúng format (thiếu http://)
403
+ - URLs bị line break
404
+ - Markdown syntax sai
405
+
406
+ **Giải pháp:**
407
+ ```python
408
+ # Test URL detection
409
+ from multimodal_pdf_parser import MultimodalPDFParser
410
+
411
+ parser = MultimodalPDFParser()
412
+ test_text = """
413
+ Xem hình: https://example.com/image.png
414
+ ![Alt](https://example.com/pic.jpg)
415
+ """
416
+
417
+ urls = parser.extract_image_urls(test_text)
418
+ print("Found URLs:", urls)
419
+ ```
420
+
421
+ ### Chatbot không return images
422
+
423
+ **Check:**
424
+ 1. Verify PDF đã được index với multimodal parser:
425
+ ```bash
426
+ curl http://localhost:8000/documents/pdf
427
+ # Look for "type": "multimodal_pdf"
428
+ ```
429
+
430
+ 2. Check metadata có `image_urls`:
431
+ ```python
432
+ response = requests.post('http://localhost:8000/chat', ...)
433
+ for ctx in response.json()['context_used']:
434
+ print(ctx['metadata'].get('image_urls', []))
435
+ ```
436
+
437
+ ### Images quá nhiều → chunks lớn
438
+
439
+ **Solution:** Giảm số images mỗi chunk:
440
+
441
+ ```python
442
+ # In multimodal_pdf_parser.py
443
+ parser = MultimodalPDFParser(
444
+ chunk_size=300, # Smaller chunks
445
+ chunk_overlap=30,
446
+ extract_images=True
447
+ )
448
+ ```
449
+
450
+ ---
451
+
452
+ ## Kết Luận
453
+
454
+ ### Khi Nào Dùng Multimodal PDF?
455
+
456
+ ✓ **Sử dụng `/upload-pdf-multimodal` khi:**
457
+ - PDF có hình ảnh minh họa (screenshots, diagrams)
458
+ - Cần chatbot reference hình ảnh khi trả lời
459
+ - User guides, tutorials với visual instructions
460
+ - Documentation với charts, tables as images
461
+
462
+ ✓ **Sử dụng `/upload-pdf` thường khi:**
463
+ - PDF chỉ có text thuần
464
+ - Không cần images trong context
465
+ - Simple documents, FAQs
466
+
467
+ ### Workflow Hoàn Chỉnh
468
+
469
+ 1. **Tạo PDF** với text + image URLs (Markdown/HTML)
470
+ 2. **Upload** qua `/upload-pdf-multimodal`
471
+ 3. **Verify** images đã được detect
472
+ 4. **Chat** - images sẽ tự động được include in context
473
+ 5. **Display** images trong UI của bạn
474
+
475
+ ---
476
+
477
+ ## Example: Full Workflow
478
+
479
+ ```python
480
+ """
481
+ Complete workflow: Create, upload, and chat with multimodal PDF
482
+ """
483
+ import requests
484
+
485
+ # 1. Upload multimodal PDF
486
+ print("=== Uploading Multimodal PDF ===")
487
+ with open('user_guide_with_images.pdf', 'rb') as f:
488
+ response = requests.post(
489
+ 'http://localhost:8000/upload-pdf-multimodal',
490
+ files={'file': f},
491
+ data={'title': 'User Guide', 'category': 'guide'}
492
+ )
493
+
494
+ result = response.json()
495
+ print(f"✓ Indexed: {result['chunks_indexed']} chunks")
496
+ print(f"✓ Message: {result['message']}")
497
+
498
+ # 2. Chat with multimodal context
499
+ print("\n=== Chatting ===")
500
+ response = requests.post('http://localhost:8000/chat', json={
501
+ 'message': 'Làm sao để tạo event mới? Cho tôi xem hình minh họa.',
502
+ 'use_rag': True,
503
+ 'use_advanced_rag': True,
504
+ 'top_k': 3,
505
+ 'hf_token': 'your_token'
506
+ })
507
+
508
+ chat_result = response.json()
509
+ print(f"Answer: {chat_result['response']}\n")
510
+
511
+ # 3. Display context with images
512
+ print("=== Context with Images ===")
513
+ for i, ctx in enumerate(chat_result['context_used'], 1):
514
+ print(f"\n[{i}] Page {ctx['metadata']['page']}, Confidence: {ctx['confidence']:.2%}")
515
+ print(f"Text: {ctx['metadata']['text'][:150]}...")
516
+
517
+ if ctx['metadata'].get('has_images'):
518
+ print(f"Images ({ctx['metadata']['num_images']}):")
519
+ for url in ctx['metadata']['image_urls']:
520
+ print(f" 🖼️ {url}")
521
+ ```
522
+
523
+ ---
524
+
525
+ **Bây giờ PDF của bạn có hình ảnh minh họa sẽ work perfectly! 🎨📄**
PDF_RAG_GUIDE.md ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hướng Dẫn Sử Dụng PDF với ChatbotRAG
2
+
3
+ ## Tổng Quan
4
+
5
+ Hệ thống ChatbotRAG hiện đã hỗ trợ **tải lên và index PDF** để chatbot có thể trả lời câu hỏi dựa trên nội dung trong PDF. Điều này rất hữu ích cho:
6
+ - Hướng dẫn sử dụng sản phẩm
7
+ - Tài liệu FAQ
8
+ - Chính sách, quy định
9
+ - Tài liệu kỹ thuật
10
+
11
+ ## Cách Thức Hoạt Động
12
+
13
+ 1. **Upload PDF** → Hệ thống parse PDF thành text
14
+ 2. **Chunking** → Text được chia thành các chunks (mặc định: 500 words/chunk, overlap 50 words)
15
+ 3. **Embedding** → Mỗi chunk được convert thành vector embedding
16
+ 4. **Indexing** → Lưu vào Qdrant + MongoDB
17
+ 5. **Chat** → Chatbot tìm kiếm chunks liên quan và trả lời câu hỏi
18
+
19
+ ## Cách 1: Upload PDF Qua API
20
+
21
+ ### Endpoint: `POST /upload-pdf`
22
+
23
+ **Request:**
24
+ ```bash
25
+ curl -X POST "http://localhost:8000/upload-pdf" \
26
+ -F "file=@huong_dan_su_dung.pdf" \
27
+ -F "title=Hướng dẫn sử dụng ChatbotRAG" \
28
+ -F "description=Tài liệu hướng dẫn đầy đủ về ChatbotRAG" \
29
+ -F "category=user_guide"
30
+ ```
31
+
32
+ **Python:**
33
+ ```python
34
+ import requests
35
+
36
+ with open('huong_dan_su_dung.pdf', 'rb') as f:
37
+ files = {'file': f}
38
+ data = {
39
+ 'title': 'Hướng dẫn sử dụng ChatbotRAG',
40
+ 'description': 'Tài liệu hướng dẫn đầy đủ',
41
+ 'category': 'user_guide'
42
+ }
43
+
44
+ response = requests.post(
45
+ 'http://localhost:8000/upload-pdf',
46
+ files=files,
47
+ data=data
48
+ )
49
+
50
+ print(response.json())
51
+ ```
52
+
53
+ **Response:**
54
+ ```json
55
+ {
56
+ "success": true,
57
+ "document_id": "pdf_20251029_143022",
58
+ "filename": "huong_dan_su_dung.pdf",
59
+ "chunks_indexed": 45,
60
+ "message": "PDF 'huong_dan_su_dung.pdf' đã được index thành công với 45 chunks"
61
+ }
62
+ ```
63
+
64
+ ### Tham Số:
65
+ - `file` (required): File PDF
66
+ - `document_id` (optional): ID tùy chỉnh, mặc định auto-generate
67
+ - `title` (optional): Tiêu đề tài liệu
68
+ - `description` (optional): Mô tả
69
+ - `category` (optional): Danh mục (user_guide, faq, policy, etc.)
70
+
71
+ ## Cách 2: Batch Index Nhiều PDFs
72
+
73
+ Nếu bạn có nhiều PDF files, sử dụng script batch:
74
+
75
+ ```bash
76
+ # Index tất cả PDFs trong thư mục
77
+ python batch_index_pdfs.py ./docs/user_guides
78
+
79
+ # Với category tùy chỉnh
80
+ python batch_index_pdfs.py ./docs/policies --category=policy
81
+
82
+ # Force reindex (ghi đè nếu đã có)
83
+ python batch_index_pdfs.py ./docs/faq --category=faq --force
84
+ ```
85
+
86
+ Script sẽ tự động:
87
+ - Scan tất cả file .pdf trong thư mục
88
+ - Index từng file với metadata phù hợp
89
+ - Skip những file đã index (trừ khi dùng --force)
90
+ - Hiển thị progress và summary
91
+
92
+ ## Quản Lý PDF Documents
93
+
94
+ ### Xem Danh Sách PDFs
95
+
96
+ ```bash
97
+ curl http://localhost:8000/documents/pdf
98
+ ```
99
+
100
+ **Response:**
101
+ ```json
102
+ {
103
+ "documents": [
104
+ {
105
+ "document_id": "pdf_user_guide",
106
+ "type": "pdf",
107
+ "filename": "huong_dan_su_dung.pdf",
108
+ "num_chunks": 45,
109
+ "metadata": {
110
+ "title": "Hướng dẫn sử dụng",
111
+ "category": "user_guide"
112
+ }
113
+ }
114
+ ],
115
+ "total": 1
116
+ }
117
+ ```
118
+
119
+ ### Xóa PDF Document
120
+
121
+ ```bash
122
+ # Xóa document và tất cả chunks của nó
123
+ curl -X DELETE http://localhost:8000/documents/pdf/pdf_user_guide
124
+ ```
125
+
126
+ ## Chat Với PDF Content
127
+
128
+ Sau khi index PDF, bạn có thể chat như bình thường:
129
+
130
+ ```python
131
+ import requests
132
+
133
+ response = requests.post('http://localhost:8000/chat', json={
134
+ 'message': 'Làm sao để upload PDF vào ChatbotRAG?',
135
+ 'use_rag': True,
136
+ 'use_advanced_rag': True,
137
+ 'top_k': 5,
138
+ 'hf_token': 'your_hf_token'
139
+ })
140
+
141
+ result = response.json()
142
+ print("Answer:", result['response'])
143
+
144
+ # Xem sources
145
+ for ctx in result['context_used']:
146
+ print(f"- Page {ctx['metadata']['page']}: {ctx['metadata']['text'][:100]}...")
147
+ ```
148
+
149
+ Chatbot sẽ tự động tìm kiếm trong PDF và trả lời dựa trên nội dung đã index.
150
+
151
+ ## Tạo PDF Hướng Dẫn Sử Dụng
152
+
153
+ ### Template Nội Dung
154
+
155
+ Dưới đây là cấu trúc đề xuất cho PDF hướng dẫn ChatbotRAG:
156
+
157
+ ```
158
+ HƯỚNG DẪN SỬ DỤNG CHATBOTRAG
159
+
160
+ 1. GIỚI THIỆU
161
+ - ChatbotRAG là gì?
162
+ - Tính năng chính
163
+ - Use cases
164
+
165
+ 2. BẮT ĐẦU NHANH
166
+ 2.1. Cài đặt
167
+ 2.2. Khởi động server
168
+ 2.3. Truy cập API
169
+
170
+ 3. INDEX DỮ LIỆU
171
+ 3.1. Index text đơn giản
172
+ 3.2. Index với images
173
+ 3.3. Index nhiều texts và images cùng lúc
174
+ 3.4. Upload PDF
175
+
176
+ 4. TÌM KIẾM
177
+ 4.1. Search bằng text
178
+ 4.2. Search bằng image
179
+ 4.3. Hybrid search
180
+
181
+ 5. CHAT VỚI CHATBOT
182
+ 5.1. Chat cơ bản
183
+ 5.2. Chat với RAG
184
+ 5.3. Advanced RAG options
185
+ 5.4. Tùy chỉnh LLM parameters
186
+
187
+ 6. QUẢN LÝ DOCUMENTS
188
+ 6.1. Xem danh sách documents
189
+ 6.2. Xóa documents
190
+ 6.3. Quản lý PDF files
191
+
192
+ 7. CÂU HỎI THƯỜNG GẶP (FAQ)
193
+ - Làm sao để upload PDF?
194
+ - Chatbot không tìm thấy thông tin?
195
+ - Làm sao để cải thiện độ chính xác?
196
+ - Token limit là bao nhiêu?
197
+
198
+ 8. API REFERENCE
199
+ - POST /index
200
+ - POST /search
201
+ - POST /chat
202
+ - POST /upload-pdf
203
+ - GET /documents/pdf
204
+ ```
205
+
206
+ ### Tạo PDF Từ Markdown
207
+
208
+ Bạn có thể tạo PDF từ Markdown bằng nhiều tools:
209
+
210
+ **1. Pandoc (Recommended):**
211
+ ```bash
212
+ pandoc guide.md -o guide.pdf --pdf-engine=xelatex
213
+ ```
214
+
215
+ **2. Online Tools:**
216
+ - https://www.markdowntopdf.com/
217
+ - https://md2pdf.netlify.app/
218
+
219
+ **3. VS Code Extension:**
220
+ - Install "Markdown PDF" extension
221
+ - Right-click file .md → "Markdown PDF: Export (pdf)"
222
+
223
+ ### Ví Dụ Markdown Content
224
+
225
+ Tạo file `chatbot_guide.md`:
226
+
227
+ ```markdown
228
+ # Hướng Dẫn Sử Dụng ChatbotRAG
229
+
230
+ ## 1. Upload PDF
231
+
232
+ Để upload PDF vào hệ thống:
233
+
234
+ ### Bước 1: Chuẩn bị file PDF
235
+ - File phải có định dạng .pdf
236
+ - Nội dung nên rõ ràng, có cấu trúc
237
+
238
+ ### Bước 2: Upload qua API
239
+
240
+ \`\`\`bash
241
+ curl -X POST "http://localhost:8000/upload-pdf" \
242
+ -F "file=@your_file.pdf" \
243
+ -F "title=Tên tài liệu"
244
+ \`\`\`
245
+
246
+ ### Bước 3: Kiểm tra
247
+ Sau khi upload, hệ thống sẽ trả về số chunks đã được index.
248
+
249
+ ## 2. Chat Với Chatbot
250
+
251
+ Sau khi upload PDF, bạn có thể hỏi chatbot:
252
+
253
+ **Ví dụ:**
254
+ - "Làm sao để upload PDF?"
255
+ - "Các bước tạo event là gì?"
256
+ - "Tính năng nào trong hệ thống?"
257
+
258
+ Chatbot sẽ tìm kiếm trong PDF và trả lời dựa trên nội dung đã index.
259
+
260
+ ## 3. FAQ
261
+
262
+ ### Câu hỏi 1: Upload PDF tối đa bao nhiêu trang?
263
+ Không giới hạn, nhưng PDF càng lớn thì thời gian index càng lâu.
264
+
265
+ ### Câu hỏi 2: Có thể upload nhiều PDFs không?
266
+ Có, bạn có thể upload nhiều PDFs. Mỗi PDF sẽ có document_id riêng.
267
+
268
+ ### Câu hỏi 3: Làm sao để xóa PDF đã upload?
269
+ Sử dụng endpoint DELETE /documents/pdf/{document_id}
270
+ ```
271
+
272
+ Sau đó convert sang PDF:
273
+ ```bash
274
+ pandoc chatbot_guide.md -o chatbot_guide.pdf
275
+ ```
276
+
277
+ ## Best Practices
278
+
279
+ ### 1. Cấu Trúc PDF
280
+ - ✓ Có tiêu đề rõ ràng
281
+ - ✓ Chia sections/chapters
282
+ - ✓ Sử dụng bullet points
283
+ - ✓ Tránh quá nhiều hình ảnh phức tạp (text extraction khó)
284
+
285
+ ### 2. Nội Dung
286
+ - ✓ Viết câu ngắn gọn, dễ hiểu
287
+ - ✓ Mỗi section tập trung 1 chủ đề
288
+ - ✓ Có ví dụ cụ thể
289
+ - ✗ Tránh văn xuôi dài, khó tách câu
290
+
291
+ ### 3. Metadata
292
+ - Luôn đặt `title` rõ ràng
293
+ - Sử dụng `category` để phân loại
294
+ - Thêm `description` cho dễ quản lý
295
+
296
+ ### 4. Chunking
297
+ Mặc định:
298
+ - Chunk size: 500 words
299
+ - Overlap: 50 words
300
+
301
+ Có thể tùy chỉnh trong `pdf_parser.py`:
302
+ ```python
303
+ parser = PDFParser(
304
+ chunk_size=500, # Tăng nếu muốn context dài hơn
305
+ chunk_overlap=50, # Tăng để giữ context tốt hơn
306
+ min_chunk_size=50 # Min words cho 1 chunk
307
+ )
308
+ ```
309
+
310
+ ## Troubleshooting
311
+
312
+ ### Lỗi: "Error reading PDF"
313
+ - Kiểm tra file PDF có bị corrupt không
314
+ - Thử mở bằng PDF reader để verify
315
+ - Convert lại PDF nếu cần
316
+
317
+ ### Lỗi: "No text extracted"
318
+ - PDF có thể là scanned images (không có text layer)
319
+ - Cần OCR trước khi index (dùng tools như Tesseract)
320
+
321
+ ### Chatbot không tìm thấy thông tin
322
+ - Kiểm tra `score_threshold` - thử giảm xuống (e.g., 0.3)
323
+ - Tăng `top_k` để retrieve nhiều documents hơn
324
+ - Rephrase câu hỏi
325
+
326
+ ### Chunks quá ngắn/dài
327
+ - Điều chỉnh `chunk_size` trong `pdf_parser.py`
328
+ - Reindex PDF với settings mới
329
+
330
+ ## Complete Example
331
+
332
+ ```python
333
+ # 1. Upload PDF
334
+ import requests
335
+
336
+ with open('user_guide.pdf', 'rb') as f:
337
+ response = requests.post(
338
+ 'http://localhost:8000/upload-pdf',
339
+ files={'file': f},
340
+ data={
341
+ 'title': 'Hướng dẫn sử dụng',
342
+ 'category': 'user_guide'
343
+ }
344
+ )
345
+
346
+ doc_id = response.json()['document_id']
347
+ print(f"Uploaded: {doc_id}")
348
+
349
+ # 2. List PDFs
350
+ response = requests.get('http://localhost:8000/documents/pdf')
351
+ print(response.json())
352
+
353
+ # 3. Chat
354
+ response = requests.post('http://localhost:8000/chat', json={
355
+ 'message': 'Làm sao để tạo event mới?',
356
+ 'use_rag': True,
357
+ 'use_advanced_rag': True,
358
+ 'hf_token': 'your_token'
359
+ })
360
+
361
+ print("Answer:", response.json()['response'])
362
+
363
+ # 4. Delete PDF (if needed)
364
+ response = requests.delete(f'http://localhost:8000/documents/pdf/{doc_id}')
365
+ print(response.json())
366
+ ```
367
+
368
+ ## Next Steps
369
+
370
+ 1. **Tạo PDF hướng dẫn của bạn** với nội dung về hệ thống của bạn
371
+ 2. **Upload PDF** vào hệ thống
372
+ 3. **Test chatbot** - hỏi các câu hỏi về nội dung trong PDF
373
+ 4. **Fine-tune** - điều chỉnh parameters nếu cần
374
+ 5. **Add more PDFs** - thêm FAQs, policies, etc.
375
+
376
+ ## Support
377
+
378
+ Nếu có vấn đề, check:
379
+ - Server logs để xem errors
380
+ - MongoDB để xem documents đã được lưu chưa
381
+ - Qdrant collection để verify chunks đã được index
382
+
383
+ ## Conclusion
384
+
385
+ Hệ thống PDF RAG giúp chatbot của bạn trả lời câu hỏi dựa trên tài liệu có sẵn, không cần train lại model. Bạn chỉ cần:
386
+ 1. Upload PDF
387
+ 2. Chat như bình thường
388
+ 3. Chatbot sẽ tìm kiếm và trả lời dựa trên PDF content
389
+
390
+ Đơn giản và hiệu quả!
QUICK_START_PDF.md ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Start: PDF-Based ChatbotRAG
2
+
3
+ ## Tóm Tắt Nhanh
4
+
5
+ Bây giờ bạn có thể:
6
+ 1. **Upload PDF** hướng dẫn sử dụng vào hệ thống
7
+ 2. **Chatbot tự động trả lời** các câu hỏi dựa trên nội dung trong PDF
8
+ 3. Không cần train model, chỉ cần upload PDF!
9
+
10
+ ---
11
+
12
+ ## Quy Trình Hoàn Chỉnh
13
+
14
+ ### Bước 1: Tạo PDF Hướng Dẫn
15
+
16
+ Bạn có 2 cách:
17
+
18
+ **Cách 1: Sử dụng Template Có Sẵn**
19
+
20
+ File `chatbot_guide_template.md` đã sẵn sàng. Customize nội dung cho hệ thống của bạn, sau đó convert sang PDF:
21
+
22
+ ```bash
23
+ # Cài pandoc (nếu chưa có)
24
+ # Windows: choco install pandoc
25
+ # Mac: brew install pandoc
26
+ # Linux: sudo apt-get install pandoc
27
+
28
+ # Convert markdown to PDF
29
+ pandoc chatbot_guide_template.md -o chatbot_user_guide.pdf --pdf-engine=xelatex
30
+ ```
31
+
32
+ **Cách 2: Tự Viết Content**
33
+
34
+ Tạo file Word/Google Docs với nội dung hướng dẫn, sau đó:
35
+ - File → Export → PDF
36
+
37
+ **Nội dung nên bao gồm:**
38
+ - Giới thiệu hệ thống
39
+ - Các chức năng chính
40
+ - Hướng dẫn sử dụng từng tính năng
41
+ - FAQ (Câu hỏi thường gặp)
42
+ - Examples
43
+
44
+ ### Bước 2: Upload PDF Vào Hệ Thống
45
+
46
+ ```bash
47
+ # Khởi động server
48
+ cd ChatbotRAG
49
+ python main.py
50
+ ```
51
+
52
+ Trong terminal khác:
53
+
54
+ ```bash
55
+ # Upload PDF
56
+ curl -X POST "http://localhost:8000/upload-pdf" \
57
+ -F "file=@chatbot_user_guide.pdf" \
58
+ -F "title=Hướng dẫn sử dụng ChatbotRAG" \
59
+ -F "description=Tài liệu hướng dẫn đầy đủ" \
60
+ -F "category=user_guide"
61
+ ```
62
+
63
+ Hoặc dùng Python:
64
+
65
+ ```python
66
+ import requests
67
+
68
+ with open('chatbot_user_guide.pdf', 'rb') as f:
69
+ response = requests.post(
70
+ 'http://localhost:8000/upload-pdf',
71
+ files={'file': f},
72
+ data={
73
+ 'title': 'Hướng dẫn sử dụng ChatbotRAG',
74
+ 'category': 'user_guide'
75
+ }
76
+ )
77
+
78
+ print(response.json())
79
+ # Output: {"success": true, "document_id": "pdf_...", "chunks_indexed": 45}
80
+ ```
81
+
82
+ ### Bước 3: Verify Upload
83
+
84
+ ```bash
85
+ # Xem danh sách PDFs
86
+ curl http://localhost:8000/documents/pdf
87
+ ```
88
+
89
+ ### Bước 4: Chat!
90
+
91
+ ```python
92
+ import requests
93
+
94
+ response = requests.post('http://localhost:8000/chat', json={
95
+ 'message': 'Làm sao để upload PDF vào ChatbotRAG?',
96
+ 'use_rag': True,
97
+ 'use_advanced_rag': True,
98
+ 'top_k': 5,
99
+ 'hf_token': 'your_huggingface_token' # Get from https://huggingface.co/settings/tokens
100
+ })
101
+
102
+ result = response.json()
103
+ print("Answer:", result['response'])
104
+ print("\nSources:")
105
+ for ctx in result['context_used']:
106
+ print(f"- Page {ctx['metadata']['page']}: Confidence {ctx['confidence']:.2%}")
107
+ ```
108
+
109
+ ---
110
+
111
+ ## Test Script Mẫu
112
+
113
+ File `test_pdf_chatbot.py`:
114
+
115
+ ```python
116
+ """
117
+ Test PDF-based chatbot
118
+ """
119
+ import requests
120
+ import time
121
+
122
+ BASE_URL = "http://localhost:8000"
123
+ HF_TOKEN = "your_huggingface_token" # Replace with your token
124
+
125
+ def upload_pdf():
126
+ """Upload PDF guide"""
127
+ print("=== Uploading PDF ===")
128
+
129
+ with open('chatbot_user_guide.pdf', 'rb') as f:
130
+ response = requests.post(
131
+ f'{BASE_URL}/upload-pdf',
132
+ files={'file': f},
133
+ data={
134
+ 'title': 'ChatbotRAG User Guide',
135
+ 'category': 'user_guide'
136
+ }
137
+ )
138
+
139
+ result = response.json()
140
+ print(f"✓ Uploaded: {result['chunks_indexed']} chunks")
141
+ return result['document_id']
142
+
143
+ def chat(question):
144
+ """Ask chatbot"""
145
+ print(f"\n=== Question: {question} ===")
146
+
147
+ response = requests.post(f'{BASE_URL}/chat', json={
148
+ 'message': question,
149
+ 'use_rag': True,
150
+ 'use_advanced_rag': True,
151
+ 'top_k': 5,
152
+ 'hf_token': HF_TOKEN
153
+ })
154
+
155
+ result = response.json()
156
+ print(f"Answer: {result['response']}\n")
157
+
158
+ print(f"Retrieved {len(result['context_used'])} documents:")
159
+ for i, ctx in enumerate(result['context_used'], 1):
160
+ print(f"{i}. Page {ctx['metadata'].get('page')}, Confidence: {ctx['confidence']:.2%}")
161
+
162
+ def main():
163
+ # 1. Upload PDF
164
+ doc_id = upload_pdf()
165
+
166
+ # Wait for indexing to complete
167
+ time.sleep(2)
168
+
169
+ # 2. Test questions
170
+ questions = [
171
+ "Làm sao để upload PDF vào hệ thống?",
172
+ "Chatbot có support tiếng Việt không?",
173
+ "Tối đa bao nhiêu texts có thể index cùng lúc?",
174
+ "Advanced RAG có những tính năng gì?"
175
+ ]
176
+
177
+ for q in questions:
178
+ chat(q)
179
+ time.sleep(1)
180
+
181
+ if __name__ == "__main__":
182
+ main()
183
+ ```
184
+
185
+ Chạy:
186
+ ```bash
187
+ python test_pdf_chatbot.py
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Upload Nhiều PDFs Cùng Lúc
193
+
194
+ Nếu bạn có nhiều PDFs (FAQ, User Guide, Policies, etc.):
195
+
196
+ ```bash
197
+ # Đặt tất cả PDFs vào thư mục
198
+ mkdir docs
199
+ # Copy PDFs vào docs/
200
+
201
+ # Batch index
202
+ python batch_index_pdfs.py ./docs --category=user_guide
203
+ ```
204
+
205
+ Script sẽ tự động index tất cả PDFs và skip những file đã có.
206
+
207
+ ---
208
+
209
+ ## Câu Hỏi Test Mẫu
210
+
211
+ Sau khi upload PDF hướng dẫn, test với các câu hỏi:
212
+
213
+ **Về tính năng:**
214
+ - "ChatbotRAG có những tính năng gì?"
215
+ - "Làm sao để index dữ liệu?"
216
+ - "Advanced RAG là gì?"
217
+
218
+ **Hướng dẫn sử dụng:**
219
+ - "Làm sao để upload PDF?"
220
+ - "Cách chat với chatbot như thế nào?"
221
+ - "Làm sao để xem lịch sử chat?"
222
+
223
+ **FAQ:**
224
+ - "Chatbot không tìm thấy thông tin phải làm sao?"
225
+ - "Tối đa bao nhiêu images có thể upload?"
226
+ - "Token limit là bao nhiêu?"
227
+
228
+ **Technical:**
229
+ - "Score threshold là gì?"
230
+ - "Top_k trong chat request có ý nghĩa gì?"
231
+ - "Làm sao để cải thiện độ chính xác?"
232
+
233
+ ---
234
+
235
+ ## Tips Để Chatbot Trả Lời Tốt
236
+
237
+ ### 1. PDF Content Quality
238
+ - Viết rõ ràng, có cấu trúc
239
+ - Mỗi section tập trung 1 topic
240
+ - Có examples cụ thể
241
+ - FAQ với câu hỏi thực tế
242
+
243
+ ### 2. Chat Settings
244
+ ```python
245
+ {
246
+ 'use_advanced_rag': True, # Luôn bật
247
+ 'use_reranking': True, # Rerank cho accuracy
248
+ 'use_compression': True, # Nén context
249
+ 'score_threshold': 0.5, # 0.4-0.6 là tốt
250
+ 'top_k': 5, # 3-7 tùy use case
251
+ 'temperature': 0.3 # Thấp cho factual answers
252
+ }
253
+ ```
254
+
255
+ ### 3. Query Tips
256
+ - Hỏi câu rõ ràng, cụ thể
257
+ - Tránh câu hỏi quá chung chung
258
+ - Nếu không tìm thấy, rephrase câu hỏi
259
+
260
+ ---
261
+
262
+ ## Monitoring
263
+
264
+ ### Check Index Status
265
+ ```bash
266
+ curl http://localhost:8000/stats
267
+ ```
268
+
269
+ ### View PDFs
270
+ ```bash
271
+ curl http://localhost:8000/documents/pdf
272
+ ```
273
+
274
+ ### Check Chat History
275
+ ```bash
276
+ curl "http://localhost:8000/history?limit=10"
277
+ ```
278
+
279
+ ---
280
+
281
+ ## Kết Luận
282
+
283
+ Bây giờ bạn có thể:
284
+
285
+ ✓ Tạo PDF hướng dẫn với nội dung của bạn
286
+ ✓ Upload PDF vào hệ thống trong vài giây
287
+ ✓ Chatbot tự động trả lời dựa trên PDF content
288
+ ✓ Không cần train, không cần code phức tạp
289
+ ✓ Update content? Chỉ cần upload PDF mới!
290
+
291
+ **Next Steps:**
292
+ 1. Tạo PDF hướng dẫn của bạn (hoặc customize template)
293
+ 2. Upload vào hệ thống
294
+ 3. Test với câu hỏi thực tế
295
+ 4. Fine-tune settings nếu cần
296
+ 5. Add thêm PDFs (FAQ, policies, etc.)
297
+
298
+ ---
299
+
300
+ ## Files Quan Trọng
301
+
302
+ - `pdf_parser.py` - PDF parsing engine
303
+ - `batch_index_pdfs.py` - Batch indexing script
304
+ - `chatbot_guide_template.md` - Template PDF content
305
+ - `PDF_RAG_GUIDE.md` - Chi tiết về PDF RAG
306
+ - `ADVANCED_RAG_GUIDE.md` - Advanced RAG features
307
+
308
+ ---
309
+
310
+ **Chúc bạn thành công! 🚀**
SUMMARY.md ADDED
@@ -0,0 +1,429 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ChatbotRAG - Complete Summary
2
+
3
+ ## Tổng Quan Hệ Thống
4
+
5
+ Hệ thống ChatbotRAG hiện đã được nâng cấp toàn diện với các tính năng advanced:
6
+
7
+ ### ✨ Tính Năng Chính
8
+
9
+ 1. **Multiple Inputs Support** (/index)
10
+ - Index tối đa 10 texts + 10 images cùng lúc
11
+ - Average embeddings tự động
12
+
13
+ 2. **Advanced RAG Pipeline** (/chat)
14
+ - Query Expansion
15
+ - Multi-Query Retrieval
16
+ - Reranking with semantic similarity
17
+ - Contextual Compression
18
+ - Better Prompt Engineering
19
+
20
+ 3. **PDF Support** (/upload-pdf)
21
+ - Parse PDF thành chunks
22
+ - Auto chunking với overlap
23
+ - Index vào RAG system
24
+
25
+ 4. **Multimodal PDF** (/upload-pdf-multimodal) ⭐ NEW
26
+ - Extract text + image URLs từ PDF
27
+ - Link images với text chunks
28
+ - Return images cùng text trong chat
29
+ - Perfect cho user guides với screenshots
30
+
31
+ ---
32
+
33
+ ## Kiến Trúc Hệ Thống
34
+
35
+ ```
36
+ ┌─────────────────────────────────────────────────────────────┐
37
+ │ FastAPI Application │
38
+ ├─────────────────────────────────────────────────────────────┤
39
+ │ │
40
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
41
+ │ │ Indexing │ │ Search │ │ Chat │ │
42
+ │ │ Endpoints │ │ Endpoints │ │ Endpoint │ │
43
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
44
+ │ │
45
+ ├─────────────────────────────────────────────────────────────┤
46
+ │ │
47
+ │ ┌──────────────────────────────────────────────────────┐ │
48
+ │ │ Advanced RAG Pipeline │ │
49
+ │ │ • Query Expansion │ │
50
+ │ │ • Multi-Query Retrieval │ │
51
+ │ │ • Reranking │ │
52
+ │ │ • Contextual Compression │ │
53
+ │ └──────────────────────────────────────────────────────┘ │
54
+ │ │
55
+ ├─────────────────────────────────────────────────────────────┤
56
+ │ │
57
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
58
+ │ │ Jina CLIP │ │ Qdrant │ │ MongoDB │ │
59
+ │ │ v2 │ │ Vector DB │ │ Documents │ │
60
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
61
+ │ │
62
+ │ ┌──────────────┐ ┌──────────────┐ │
63
+ │ │ PDF │ │ Multimodal │ │
64
+ │ │ Parser │ │ PDF Parser │ │
65
+ │ └──────────────┘ └──────────────┘ │
66
+ │ │
67
+ └─────────────────────────────────────────────────────────────┘
68
+ ```
69
+
70
+ ---
71
+
72
+ ## Files Quan Trọng
73
+
74
+ ### Core System
75
+ - **main.py** - FastAPI application với tất cả endpoints
76
+ - **embedding_service.py** - Jina CLIP v2 embedding
77
+ - **qdrant_service.py** - Qdrant vector DB operations
78
+ - **advanced_rag.py** - Advanced RAG pipeline
79
+
80
+ ### PDF Processing
81
+ - **pdf_parser.py** - Basic PDF parser (text only)
82
+ - **multimodal_pdf_parser.py** - Multimodal PDF parser (text + images)
83
+ - **batch_index_pdfs.py** - Batch indexing script
84
+
85
+ ### Documentation
86
+ - **ADVANCED_RAG_GUIDE.md** - Advanced RAG features guide
87
+ - **PDF_RAG_GUIDE.md** - PDF usage guide
88
+ - **MULTIMODAL_PDF_GUIDE.md** - Multimodal PDF guide ⭐
89
+ - **QUICK_START_PDF.md** - Quick start for PDF
90
+ - **chatbot_guide_template.md** - Template for user guide PDF
91
+
92
+ ### Testing
93
+ - **test_advanced_features.py** - Test advanced features
94
+ - **test_pdf_chatbot.py** - Test PDF chatbot (example in docs)
95
+
96
+ ---
97
+
98
+ ## API Endpoints
99
+
100
+ ### 1. Indexing
101
+
102
+ | Endpoint | Method | Description |
103
+ |----------|--------|-------------|
104
+ | `/index` | POST | Index texts + images (max 10 each) |
105
+ | `/documents` | POST | Add text document |
106
+ | `/upload-pdf` | POST | Upload PDF (text only) |
107
+ | `/upload-pdf-multimodal` | POST | Upload PDF with images ⭐ |
108
+
109
+ ### 2. Search
110
+
111
+ | Endpoint | Method | Description |
112
+ |----------|--------|-------------|
113
+ | `/search` | POST | Hybrid search (text + image) |
114
+ | `/search/text` | POST | Text-only search |
115
+ | `/search/image` | POST | Image-only search |
116
+ | `/rag/search` | POST | RAG knowledge base search |
117
+
118
+ ### 3. Chat
119
+
120
+ | Endpoint | Method | Description |
121
+ |----------|--------|-------------|
122
+ | `/chat` | POST | Chat with Advanced RAG |
123
+
124
+ ### 4. Management
125
+
126
+ | Endpoint | Method | Description |
127
+ |----------|--------|-------------|
128
+ | `/documents/pdf` | GET | List all PDFs |
129
+ | `/documents/pdf/{id}` | DELETE | Delete PDF document |
130
+ | `/delete/{doc_id}` | DELETE | Delete document |
131
+ | `/document/{doc_id}` | GET | Get document by ID |
132
+ | `/history` | GET | Get chat history |
133
+ | `/stats` | GET | Collection statistics |
134
+ | `/` | GET | Health check + API docs |
135
+
136
+ ---
137
+
138
+ ## Use Cases & Recommendations
139
+
140
+ ### Case 1: PDF Hướng Dẫn Chỉ Có Text
141
+
142
+ **Scenario:** FAQ, policy document, text guide
143
+
144
+ **Solution:** `/upload-pdf`
145
+
146
+ ```bash
147
+ curl -X POST "http://localhost:8000/upload-pdf" \
148
149
+ -F "title=FAQ"
150
+ ```
151
+
152
+ ### Case 2: PDF Hướng Dẫn Có Hình Ảnh ⭐ (Your Case)
153
+
154
+ **Scenario:** User guide với screenshots, tutorial với diagrams
155
+
156
+ **Solution:** `/upload-pdf-multimodal`
157
+
158
+ ```bash
159
+ curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
160
+ -F "file=@user_guide_with_images.pdf" \
161
+ -F "title=User Guide" \
162
+ -F "category=guide"
163
+ ```
164
+
165
+ **Benefits:**
166
+ - ✓ Extract text + image URLs
167
+ - ✓ Link images với text chunks
168
+ - ✓ Chatbot return images in response
169
+ - ✓ Visual context for users
170
+
171
+ ### Case 3: Multiple Social Media Posts
172
+
173
+ **Scenario:** Index nhiều posts với texts và images
174
+
175
+ **Solution:** `/index` with multiple inputs
176
+
177
+ ```python
178
+ data = {
179
+ 'id': 'post123',
180
+ 'texts': ['Post text 1', 'Post text 2', ...], # Max 10
181
+ }
182
+ files = [
183
+ ('images', open('img1.jpg', 'rb')),
184
+ ('images', open('img2.jpg', 'rb')), # Max 10
185
+ ]
186
+ requests.post('http://localhost:8000/index', data=data, files=files)
187
+ ```
188
+
189
+ ### Case 4: Complex Queries
190
+
191
+ **Scenario:** Câu hỏi phức tạp, cần độ chính xác cao
192
+
193
+ **Solution:** Advanced RAG with full options
194
+
195
+ ```python
196
+ {
197
+ 'message': 'Complex question',
198
+ 'use_rag': True,
199
+ 'use_advanced_rag': True,
200
+ 'use_reranking': True,
201
+ 'use_compression': True,
202
+ 'score_threshold': 0.5,
203
+ 'top_k': 5
204
+ }
205
+ ```
206
+
207
+ ---
208
+
209
+ ## Workflow Đề Xuất Cho Bạn
210
+
211
+ ### Setup Ban Đầu
212
+
213
+ 1. **Tạo PDF hướng dẫn sử dụng**
214
+ - Dùng template: `chatbot_guide_template.md`
215
+ - Customize nội dung cho hệ thống của bạn
216
+ - Thêm image URLs (screenshots, diagrams)
217
+ - Convert to PDF: `pandoc template.md -o guide.pdf`
218
+
219
+ 2. **Upload PDF**
220
+ ```bash
221
+ curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
222
+ -F "file=@chatbot_user_guide.pdf" \
223
+ -F "title=Hướng dẫn sử dụng ChatbotRAG" \
224
+ -F "category=user_guide"
225
+ ```
226
+
227
+ 3. **Verify**
228
+ ```bash
229
+ curl http://localhost:8000/documents/pdf
230
+ # Check "type": "multimodal_pdf" và "total_images"
231
+ ```
232
+
233
+ ### Sử Dụng Hàng Ngày
234
+
235
+ 1. **Chat với user**
236
+ ```python
237
+ response = requests.post('http://localhost:8000/chat', json={
238
+ 'message': user_question,
239
+ 'use_rag': True,
240
+ 'use_advanced_rag': True,
241
+ 'hf_token': 'your_token'
242
+ })
243
+ ```
244
+
245
+ 2. **Display response + images**
246
+ ```python
247
+ # Text answer
248
+ print(response.json()['response'])
249
+
250
+ # Images (if any)
251
+ for ctx in response.json()['context_used']:
252
+ if ctx['metadata'].get('has_images'):
253
+ for url in ctx['metadata']['image_urls']:
254
+ # Display image in your UI
255
+ print(f"Image: {url}")
256
+ ```
257
+
258
+ ### Cập Nhật Content
259
+
260
+ 1. **Update PDF** - Edit và re-export
261
+ 2. **Xóa PDF cũ**
262
+ ```bash
263
+ curl -X DELETE http://localhost:8000/documents/pdf/old_doc_id
264
+ ```
265
+ 3. **Upload PDF mới**
266
+ ```bash
267
+ curl -X POST http://localhost:8000/upload-pdf-multimodal -F "file=@new_guide.pdf"
268
+ ```
269
+
270
+ ---
271
+
272
+ ## Performance Tips
273
+
274
+ ### 1. Chunking
275
+
276
+ **Default:**
277
+ - chunk_size: 500 words
278
+ - chunk_overlap: 50 words
279
+
280
+ **Tối ưu:**
281
+ ```python
282
+ # In multimodal_pdf_parser.py
283
+ parser = MultimodalPDFParser(
284
+ chunk_size=400, # Shorter for faster retrieval
285
+ chunk_overlap=40,
286
+ min_chunk_size=50
287
+ )
288
+ ```
289
+
290
+ ### 2. Retrieval
291
+
292
+ **Settings tốt:**
293
+ ```python
294
+ {
295
+ 'top_k': 5, # 3-7 is optimal
296
+ 'score_threshold': 0.5, # 0.4-0.6 is good
297
+ 'use_reranking': True, # Always enable
298
+ 'use_compression': True # Keeps context relevant
299
+ }
300
+ ```
301
+
302
+ ### 3. LLM
303
+
304
+ **For factual answers:**
305
+ ```python
306
+ {
307
+ 'temperature': 0.3, # Low for accuracy
308
+ 'max_tokens': 512, # Concise answers
309
+ 'top_p': 0.9
310
+ }
311
+ ```
312
+
313
+ ---
314
+
315
+ ## Troubleshooting
316
+
317
+ ### Issue 1: Images không được detect
318
+
319
+ **Solution:**
320
+ - Verify PDF có image URLs (http://, https://)
321
+ - Check format: markdown `![](url)` hoặc HTML `<img src>`
322
+ - Test regex:
323
+ ```python
324
+ from multimodal_pdf_parser import MultimodalPDFParser
325
+ parser = MultimodalPDFParser()
326
+ urls = parser.extract_image_urls("![](https://example.com/img.png)")
327
+ print(urls) # Should return ['https://example.com/img.png']
328
+ ```
329
+
330
+ ### Issue 2: Chatbot không tìm thấy thông tin
331
+
332
+ **Solution:**
333
+ - Lower score_threshold: `0.3-0.5`
334
+ - Increase top_k: `5-10`
335
+ - Enable Advanced RAG
336
+ - Rephrase question
337
+
338
+ ### Issue 3: Response quá chậm
339
+
340
+ **Solution:**
341
+ - Giảm top_k
342
+ - Disable compression nếu không cần
343
+ - Use basic RAG thay vì advanced for simple queries
344
+
345
+ ---
346
+
347
+ ## Next Steps
348
+
349
+ ### Immediate (Bây Giờ)
350
+
351
+ 1. ✓ System đã ready!
352
+ 2. Tạo PDF hướng dẫn của bạn
353
+ 3. Upload qua `/upload-pdf-multimodal`
354
+ 4. Test với câu hỏi thực tế
355
+
356
+ ### Short Term (1-2 tuần)
357
+
358
+ 1. Collect user feedback
359
+ 2. Fine-tune parameters (top_k, threshold)
360
+ 3. Add more PDFs (FAQ, tutorials, etc.)
361
+ 4. Monitor chat history để improve content
362
+
363
+ ### Long Term (Sau này)
364
+
365
+ 1. **Hybrid Search với BM25**
366
+ - Combine dense + sparse retrieval
367
+ - Better for keyword queries
368
+
369
+ 2. **Cross-Encoder Reranking**
370
+ - Replace embedding similarity
371
+ - More accurate ranking
372
+
373
+ 3. **Image Processing**
374
+ - Download và process actual images
375
+ - Use Jina CLIP for image embeddings
376
+ - True multimodal embeddings (text + image vectors)
377
+
378
+ 4. **RAG-Anything Integration** (Nếu cần)
379
+ - For complex PDFs with tables, charts
380
+ - Vision encoder for embedded images
381
+ - Advanced document understanding
382
+
383
+ ---
384
+
385
+ ## Comparison Matrix
386
+
387
+ | Approach | Text | Images | URLs | Complexity | Your Case |
388
+ |----------|------|--------|------|------------|-----------|
389
+ | Basic RAG | ✓ | ✗ | ✗ | Low | ✗ |
390
+ | PDF Parser | ✓ | ✗ | ✗ | Low | ✗ |
391
+ | **Multimodal PDF** | ✓ | ✗ | ✓ | **Medium** | **✓** |
392
+ | RAG-Anything | ✓ | ✓ | ✓ | High | Overkill |
393
+
394
+ **Recommendation:** **Multimodal PDF** là perfect cho case của bạn!
395
+
396
+ ---
397
+
398
+ ## Kết Luận
399
+
400
+ ### Bạn Có Gì?
401
+
402
+ ✅ **Multiple Inputs**: Index 10 texts + 10 images
403
+ ✅ **Advanced RAG**: Query expansion, reranking, compression
404
+ ✅ **PDF Support**: Parse và index PDFs
405
+ ✅ **Multimodal PDF**: Extract text + image URLs, link together
406
+ ✅ **Complete Documentation**: Guides, examples, troubleshooting
407
+
408
+ ### Làm Gì Tiếp?
409
+
410
+ 1. **Tạo PDF** hướng dẫn với nội dung của bạn (có image URLs)
411
+ 2. **Upload** qua `/upload-pdf-multimodal`
412
+ 3. **Test** với câu hỏi thực tế
413
+ 4. **Iterate** - fine-tune based on feedback
414
+
415
+ ### Files Cần Đọc
416
+
417
+ **Cho PDF với hình ảnh (Your case):**
418
+ - [MULTIMODAL_PDF_GUIDE.md](MULTIMODAL_PDF_GUIDE.md) ⭐⭐⭐
419
+ - [PDF_RAG_GUIDE.md](PDF_RAG_GUIDE.md)
420
+
421
+ **Cho Advanced RAG:**
422
+ - [ADVANCED_RAG_GUIDE.md](ADVANCED_RAG_GUIDE.md)
423
+
424
+ **Quick Start:**
425
+ - [QUICK_START_PDF.md](QUICK_START_PDF.md)
426
+
427
+ ---
428
+
429
+ **Hệ thống của bạn bây giờ rất mạnh! Chỉ cần upload PDF và chat thôi! 🚀📄🤖**
advanced_rag.py ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Advanced RAG techniques for improved retrieval and generation
3
+ Includes: Query Expansion, Reranking, Contextual Compression, Hybrid Search
4
+ """
5
+
6
+ from typing import List, Dict, Optional, Tuple
7
+ import numpy as np
8
+ from dataclasses import dataclass
9
+ import re
10
+
11
+
12
+ @dataclass
13
+ class RetrievedDocument:
14
+ """Document retrieved from vector database"""
15
+ id: str
16
+ text: str
17
+ confidence: float
18
+ metadata: Dict
19
+
20
+
21
+ class AdvancedRAG:
22
+ """Advanced RAG system with modern techniques"""
23
+
24
+ def __init__(self, embedding_service, qdrant_service):
25
+ self.embedding_service = embedding_service
26
+ self.qdrant_service = qdrant_service
27
+
28
+ def expand_query(self, query: str) -> List[str]:
29
+ """
30
+ Expand query with related terms and variations
31
+ Simple rule-based expansion for Vietnamese queries
32
+ """
33
+ queries = [query]
34
+
35
+ # Add query variations
36
+ # Remove question words for alternative search
37
+ question_words = ['ai', 'gì', 'nào', 'đâu', 'khi nào', 'như thế nào',
38
+ 'tại sao', 'có', 'là', 'được', 'không']
39
+
40
+ query_lower = query.lower()
41
+ for qw in question_words:
42
+ if qw in query_lower:
43
+ variant = query_lower.replace(qw, '').strip()
44
+ if variant and variant != query_lower:
45
+ queries.append(variant)
46
+
47
+ # Extract key nouns/phrases (simple approach)
48
+ words = query.split()
49
+ if len(words) > 3:
50
+ # Take important words (skip first question word)
51
+ key_phrases = ' '.join(words[1:]) if words[0].lower() in question_words else ' '.join(words[:3])
52
+ if key_phrases not in queries:
53
+ queries.append(key_phrases)
54
+
55
+ return queries[:3] # Return top 3 variations
56
+
57
+ def multi_query_retrieval(
58
+ self,
59
+ query: str,
60
+ top_k: int = 5,
61
+ score_threshold: float = 0.5
62
+ ) -> List[RetrievedDocument]:
63
+ """
64
+ Retrieve documents using multiple query variations
65
+ Combines results from all query variations
66
+ """
67
+ expanded_queries = self.expand_query(query)
68
+
69
+ all_results = {} # Use dict to deduplicate by doc_id
70
+
71
+ for q in expanded_queries:
72
+ # Generate embedding for each query variant
73
+ query_embedding = self.embedding_service.encode_text(q)
74
+
75
+ # Search in Qdrant
76
+ results = self.qdrant_service.search(
77
+ query_embedding=query_embedding,
78
+ limit=top_k,
79
+ score_threshold=score_threshold
80
+ )
81
+
82
+ # Add to results (keep highest score for duplicates)
83
+ for result in results:
84
+ doc_id = result["id"]
85
+ if doc_id not in all_results or result["confidence"] > all_results[doc_id].confidence:
86
+ all_results[doc_id] = RetrievedDocument(
87
+ id=doc_id,
88
+ text=result["metadata"].get("text", ""),
89
+ confidence=result["confidence"],
90
+ metadata=result["metadata"]
91
+ )
92
+
93
+ # Sort by confidence and return top_k
94
+ sorted_results = sorted(all_results.values(), key=lambda x: x.confidence, reverse=True)
95
+ return sorted_results[:top_k]
96
+
97
+ def rerank_documents(
98
+ self,
99
+ query: str,
100
+ documents: List[RetrievedDocument],
101
+ use_cross_encoder: bool = False
102
+ ) -> List[RetrievedDocument]:
103
+ """
104
+ Rerank documents based on semantic similarity
105
+ Simple reranking using embedding similarity (can be upgraded to cross-encoder)
106
+ """
107
+ if not documents:
108
+ return documents
109
+
110
+ # Simple reranking: recalculate similarity with original query
111
+ query_embedding = self.embedding_service.encode_text(query)
112
+
113
+ reranked = []
114
+ for doc in documents:
115
+ # Get document embedding
116
+ doc_embedding = self.embedding_service.encode_text(doc.text)
117
+
118
+ # Calculate cosine similarity
119
+ similarity = np.dot(query_embedding.flatten(), doc_embedding.flatten())
120
+
121
+ # Combine with original confidence (weighted average)
122
+ new_score = 0.6 * similarity + 0.4 * doc.confidence
123
+
124
+ reranked.append(RetrievedDocument(
125
+ id=doc.id,
126
+ text=doc.text,
127
+ confidence=float(new_score),
128
+ metadata=doc.metadata
129
+ ))
130
+
131
+ # Sort by new score
132
+ reranked.sort(key=lambda x: x.confidence, reverse=True)
133
+ return reranked
134
+
135
+ def compress_context(
136
+ self,
137
+ query: str,
138
+ documents: List[RetrievedDocument],
139
+ max_tokens: int = 500
140
+ ) -> List[RetrievedDocument]:
141
+ """
142
+ Compress context to most relevant parts
143
+ Remove redundant information and keep only relevant sentences
144
+ """
145
+ compressed_docs = []
146
+
147
+ for doc in documents:
148
+ # Split into sentences
149
+ sentences = self._split_sentences(doc.text)
150
+
151
+ # Score each sentence based on relevance to query
152
+ scored_sentences = []
153
+ query_words = set(query.lower().split())
154
+
155
+ for sent in sentences:
156
+ sent_words = set(sent.lower().split())
157
+ # Simple relevance: word overlap
158
+ overlap = len(query_words & sent_words)
159
+ if overlap > 0:
160
+ scored_sentences.append((sent, overlap))
161
+
162
+ # Sort by relevance and take top sentences
163
+ scored_sentences.sort(key=lambda x: x[1], reverse=True)
164
+
165
+ # Reconstruct compressed text (up to max_tokens)
166
+ compressed_text = ""
167
+ word_count = 0
168
+ for sent, score in scored_sentences:
169
+ sent_words = len(sent.split())
170
+ if word_count + sent_words <= max_tokens:
171
+ compressed_text += sent + " "
172
+ word_count += sent_words
173
+ else:
174
+ break
175
+
176
+ # If nothing selected, take original first part
177
+ if not compressed_text.strip():
178
+ compressed_text = doc.text[:max_tokens * 5] # Rough estimate
179
+
180
+ compressed_docs.append(RetrievedDocument(
181
+ id=doc.id,
182
+ text=compressed_text.strip(),
183
+ confidence=doc.confidence,
184
+ metadata=doc.metadata
185
+ ))
186
+
187
+ return compressed_docs
188
+
189
+ def _split_sentences(self, text: str) -> List[str]:
190
+ """Split text into sentences (Vietnamese-aware)"""
191
+ # Simple sentence splitter
192
+ sentences = re.split(r'[.!?]+', text)
193
+ return [s.strip() for s in sentences if s.strip()]
194
+
195
+ def hybrid_rag_pipeline(
196
+ self,
197
+ query: str,
198
+ top_k: int = 5,
199
+ score_threshold: float = 0.5,
200
+ use_reranking: bool = True,
201
+ use_compression: bool = True,
202
+ max_context_tokens: int = 500
203
+ ) -> Tuple[List[RetrievedDocument], Dict]:
204
+ """
205
+ Complete advanced RAG pipeline
206
+ 1. Multi-query retrieval
207
+ 2. Reranking
208
+ 3. Contextual compression
209
+ """
210
+ stats = {
211
+ "original_query": query,
212
+ "expanded_queries": [],
213
+ "initial_results": 0,
214
+ "after_rerank": 0,
215
+ "after_compression": 0
216
+ }
217
+
218
+ # Step 1: Multi-query retrieval
219
+ expanded_queries = self.expand_query(query)
220
+ stats["expanded_queries"] = expanded_queries
221
+
222
+ documents = self.multi_query_retrieval(
223
+ query=query,
224
+ top_k=top_k * 2, # Get more candidates for reranking
225
+ score_threshold=score_threshold
226
+ )
227
+ stats["initial_results"] = len(documents)
228
+
229
+ # Step 2: Reranking (optional)
230
+ if use_reranking and documents:
231
+ documents = self.rerank_documents(query, documents)
232
+ documents = documents[:top_k] # Keep top_k after reranking
233
+ stats["after_rerank"] = len(documents)
234
+
235
+ # Step 3: Contextual compression (optional)
236
+ if use_compression and documents:
237
+ documents = self.compress_context(
238
+ query=query,
239
+ documents=documents,
240
+ max_tokens=max_context_tokens
241
+ )
242
+ stats["after_compression"] = len(documents)
243
+
244
+ return documents, stats
245
+
246
+ def format_context_for_llm(
247
+ self,
248
+ documents: List[RetrievedDocument],
249
+ include_metadata: bool = True
250
+ ) -> str:
251
+ """
252
+ Format retrieved documents into context string for LLM
253
+ Uses better structure for improved LLM understanding
254
+ """
255
+ if not documents:
256
+ return ""
257
+
258
+ context_parts = ["RELEVANT CONTEXT:\n"]
259
+
260
+ for i, doc in enumerate(documents, 1):
261
+ context_parts.append(f"\n--- Document {i} (Relevance: {doc.confidence:.2%}) ---")
262
+ context_parts.append(doc.text)
263
+
264
+ if include_metadata and doc.metadata:
265
+ # Add useful metadata
266
+ meta_str = []
267
+ for key, value in doc.metadata.items():
268
+ if key not in ['text', 'texts'] and value:
269
+ meta_str.append(f"{key}: {value}")
270
+ if meta_str:
271
+ context_parts.append(f"[Metadata: {', '.join(meta_str)}]")
272
+
273
+ context_parts.append("\n--- End of Context ---\n")
274
+ return "\n".join(context_parts)
275
+
276
+ def build_rag_prompt(
277
+ self,
278
+ query: str,
279
+ context: str,
280
+ system_message: str = "You are a helpful AI assistant."
281
+ ) -> str:
282
+ """
283
+ Build optimized RAG prompt for LLM
284
+ Uses best practices for prompt engineering
285
+ """
286
+ prompt_template = f"""{system_message}
287
+
288
+ {context}
289
+
290
+ INSTRUCTIONS:
291
+ 1. Answer the user's question using ONLY the information provided in the context above
292
+ 2. If the context doesn't contain relevant information, say "Tôi không tìm thấy thông tin liên quan trong dữ liệu."
293
+ 3. Cite relevant parts of the context when answering
294
+ 4. Be concise and accurate
295
+ 5. Answer in Vietnamese if the question is in Vietnamese
296
+
297
+ USER QUESTION: {query}
298
+
299
+ YOUR ANSWER:"""
300
+
301
+ return prompt_template
app.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Hugging Face Spaces compatible app
3
+ """
4
+ import os
5
+ import gradio as gr
6
+ from main import app as fastapi_app
7
+
8
+ # Gradio wrapper cho Hugging Face Spaces
9
+ def create_gradio_interface():
10
+ """
11
+ Tạo Gradio interface để deploy trên Hugging Face Spaces
12
+ """
13
+ with gr.Blocks(title="Event Social Media Embeddings API") as demo:
14
+ gr.Markdown("""
15
+ # 🔍 Event Social Media Embeddings API
16
+
17
+ API để embeddings và search multimodal (text + images) với **Jina CLIP v2** + **Qdrant Cloud**
18
+
19
+ ## 🌟 Features:
20
+ - ✅ Multimodal: Text + Image embeddings
21
+ - ✅ Tiếng Việt: 100% support
22
+ - ✅ High Performance: ONNX + HNSW
23
+ - ✅ Cloud: Qdrant Cloud
24
+
25
+ ## 📡 API Endpoints:
26
+ - `POST /index` - Index data
27
+ - `POST /search` - Hybrid search
28
+ - `POST /search/text` - Text search
29
+ - `POST /search/image` - Image search
30
+
31
+ ### 🔗 API Docs:
32
+ Truy cập `/docs` để xem API documentation đầy đủ
33
+ """)
34
+
35
+ gr.Markdown("### API is running at the `/docs` endpoint")
36
+
37
+ return demo
38
+
39
+ # Mount FastAPI app
40
+ demo = create_gradio_interface()
41
+
42
+ # Wrap FastAPI với Gradio
43
+ app = gr.mount_gradio_app(fastapi_app, demo, path="/")
44
+
45
+ if __name__ == "__main__":
46
+ import uvicorn
47
+ uvicorn.run(app, host="0.0.0.0", port=7860)
batch_index_pdfs.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Batch script to index PDF files into RAG knowledge base
3
+ Usage: python batch_index_pdfs.py <pdf_directory> [options]
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ from pathlib import Path
9
+ from pymongo import MongoClient
10
+ from embedding_service import JinaClipEmbeddingService
11
+ from qdrant_service import QdrantVectorService
12
+ from pdf_parser import PDFIndexer
13
+
14
+
15
+ def index_pdf_directory(
16
+ pdf_dir: str,
17
+ category: str = "user_guide",
18
+ force: bool = False
19
+ ):
20
+ """
21
+ Index all PDF files in a directory
22
+
23
+ Args:
24
+ pdf_dir: Directory containing PDF files
25
+ category: Category for the PDFs (default: "user_guide")
26
+ force: Force reindex even if already indexed (default: False)
27
+ """
28
+ print("="*60)
29
+ print("PDF Batch Indexer")
30
+ print("="*60)
31
+
32
+ # Initialize services (same as main.py)
33
+ print("\n[1/5] Initializing services...")
34
+ embedding_service = JinaClipEmbeddingService(model_path="jinaai/jina-clip-v2")
35
+
36
+ collection_name = os.getenv("COLLECTION_NAME", "event_social_media")
37
+ qdrant_service = QdrantVectorService(
38
+ collection_name=collection_name,
39
+ vector_size=embedding_service.get_embedding_dimension()
40
+ )
41
+
42
+ # MongoDB
43
+ mongodb_uri = os.getenv("MONGODB_URI", "mongodb+srv://truongtn7122003:[email protected]/")
44
+ mongo_client = MongoClient(mongodb_uri)
45
+ db = mongo_client[os.getenv("MONGODB_DB_NAME", "chatbot_rag")]
46
+ documents_collection = db["documents"]
47
+
48
+ # Initialize PDF indexer
49
+ pdf_indexer = PDFIndexer(
50
+ embedding_service=embedding_service,
51
+ qdrant_service=qdrant_service,
52
+ documents_collection=documents_collection
53
+ )
54
+ print("✓ Services initialized")
55
+
56
+ # Find all PDF files
57
+ print(f"\n[2/5] Scanning directory: {pdf_dir}")
58
+ pdf_files = list(Path(pdf_dir).glob("*.pdf"))
59
+
60
+ if not pdf_files:
61
+ print("✗ No PDF files found in directory")
62
+ return
63
+
64
+ print(f"✓ Found {len(pdf_files)} PDF file(s)")
65
+
66
+ # Index each PDF
67
+ print(f"\n[3/5] Indexing PDFs...")
68
+ indexed_count = 0
69
+ skipped_count = 0
70
+ error_count = 0
71
+
72
+ for i, pdf_path in enumerate(pdf_files, 1):
73
+ print(f"\n--- [{i}/{len(pdf_files)}] Processing: {pdf_path.name} ---")
74
+
75
+ # Generate document ID
76
+ doc_id = f"pdf_{pdf_path.stem}"
77
+
78
+ # Check if already indexed
79
+ if not force:
80
+ existing = documents_collection.find_one({"document_id": doc_id})
81
+ if existing:
82
+ print(f"⊘ Already indexed (use --force to reindex)")
83
+ skipped_count += 1
84
+ continue
85
+
86
+ try:
87
+ # Index PDF
88
+ metadata = {
89
+ 'title': pdf_path.stem.replace('_', ' ').title(),
90
+ 'category': category,
91
+ 'source_file': str(pdf_path)
92
+ }
93
+
94
+ result = pdf_indexer.index_pdf(
95
+ pdf_path=str(pdf_path),
96
+ document_id=doc_id,
97
+ document_metadata=metadata
98
+ )
99
+
100
+ print(f"✓ Indexed: {result['chunks_indexed']} chunks")
101
+ indexed_count += 1
102
+
103
+ except Exception as e:
104
+ print(f"✗ Error: {str(e)}")
105
+ error_count += 1
106
+
107
+ # Summary
108
+ print("\n" + "="*60)
109
+ print("SUMMARY")
110
+ print("="*60)
111
+ print(f"Total PDFs found: {len(pdf_files)}")
112
+ print(f"✓ Successfully indexed: {indexed_count}")
113
+ print(f"⊘ Skipped (already indexed): {skipped_count}")
114
+ print(f"✗ Errors: {error_count}")
115
+
116
+ if indexed_count > 0:
117
+ print(f"\n✓ Knowledge base updated successfully!")
118
+ print(f"You can now chat with your chatbot about the content in these PDFs.")
119
+
120
+
121
+ def main():
122
+ """Main entry point"""
123
+ if len(sys.argv) < 2:
124
+ print("Usage: python batch_index_pdfs.py <pdf_directory> [--category=<category>] [--force]")
125
+ print("\nExample:")
126
+ print(" python batch_index_pdfs.py ./docs/guides")
127
+ print(" python batch_index_pdfs.py ./docs/guides --category=user_guide --force")
128
+ sys.exit(1)
129
+
130
+ pdf_dir = sys.argv[1]
131
+
132
+ if not os.path.isdir(pdf_dir):
133
+ print(f"Error: Directory not found: {pdf_dir}")
134
+ sys.exit(1)
135
+
136
+ # Parse options
137
+ category = "user_guide"
138
+ force = False
139
+
140
+ for arg in sys.argv[2:]:
141
+ if arg.startswith("--category="):
142
+ category = arg.split("=")[1]
143
+ elif arg == "--force":
144
+ force = True
145
+
146
+ # Index PDFs
147
+ index_pdf_directory(pdf_dir, category=category, force=force)
148
+
149
+
150
+ if __name__ == "__main__":
151
+ main()
chatbot_guide_template.md ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hướng Dẫn Sử Dụng ChatbotRAG
2
+
3
+ *Version 2.0 - Tháng 10, 2025*
4
+
5
+ ---
6
+
7
+ ## 1. Giới Thiệu
8
+
9
+ ### ChatbotRAG là gì?
10
+
11
+ ChatbotRAG là hệ thống chatbot thông minh sử dụng công nghệ RAG (Retrieval-Augmented Generation) để trả lời câu hỏi dựa trên cơ sở dữ liệu kiến thức của bạn.
12
+
13
+ ### Tính năng chính
14
+
15
+ - **Multimodal Search**: Tìm kiếm bằng text và hình ảnh
16
+ - **Advanced RAG**: Query expansion, reranking, context compression
17
+ - **PDF Support**: Upload PDF và chat về nội dung trong PDF
18
+ - **Multiple Inputs**: Index nhiều texts và images cùng lúc (tối đa 10 mỗi loại)
19
+ - **Chat History**: Lưu lịch sử chat để theo dõi
20
+
21
+ ---
22
+
23
+ ## 2. Bắt Đầu Nhanh
24
+
25
+ ### Bước 1: Khởi động server
26
+
27
+ ```bash
28
+ cd ChatbotRAG
29
+ python main.py
30
+ ```
31
+
32
+ Server sẽ chạy tại: `http://localhost:8000`
33
+
34
+ ### Bước 2: Truy cập API Documentation
35
+
36
+ Mở trình duyệt và truy cập:
37
+ - API Docs: `http://localhost:8000/docs`
38
+ - ReDoc: `http://localhost:8000/redoc`
39
+
40
+ ### Bước 3: Test với câu hỏi đơn giản
41
+
42
+ ```bash
43
+ curl -X POST "http://localhost:8000/chat" \
44
+ -H "Content-Type: application/json" \
45
+ -d '{"message": "Xin chào, bạn là ai?"}'
46
+ ```
47
+
48
+ ---
49
+
50
+ ## 3. Index Dữ Liệu
51
+
52
+ ### 3.1. Index Text Đơn Giản
53
+
54
+ ```bash
55
+ curl -X POST "http://localhost:8000/index" \
56
+ -F "id=doc1" \
57
+ -F "texts=Đây là text nội dung 1" \
58
+ -F "texts=Đây là text nội dung 2"
59
+ ```
60
+
61
+ ### 3.2. Index Với Images
62
+
63
+ ```bash
64
+ curl -X POST "http://localhost:8000/index" \
65
+ -F "id=event123" \
66
+ -F "texts=Sự kiện âm nhạc tại Hà Nội" \
67
68
69
+ ```
70
+
71
+ **Lưu ý**: Tối đa 10 texts và 10 images mỗi request.
72
+
73
+ ### 3.3. Upload PDF
74
+
75
+ Để upload tài liệu PDF vào hệ thống:
76
+
77
+ ```bash
78
+ curl -X POST "http://localhost:8000/upload-pdf" \
79
+ -F "file=@user_guide.pdf" \
80
+ -F "title=Hướng dẫn sử dụng" \
81
+ -F "category=user_guide"
82
+ ```
83
+
84
+ Sau khi upload, chatbot có thể trả lời câu hỏi về nội dung trong PDF.
85
+
86
+ ---
87
+
88
+ ## 4. Tìm Kiếm Dữ Liệu
89
+
90
+ ### 4.1. Search Bằng Text
91
+
92
+ ```bash
93
+ curl -X POST "http://localhost:8000/search/text" \
94
+ -F "text=sự kiện âm nhạc" \
95
+ -F "limit=5"
96
+ ```
97
+
98
+ ### 4.2. Search Bằng Image
99
+
100
+ ```bash
101
+ curl -X POST "http://localhost:8000/search/image" \
102
+ -F "image=@query_image.jpg" \
103
+ -F "limit=5"
104
+ ```
105
+
106
+ ### 4.3. Hybrid Search (Text + Image)
107
+
108
+ ```bash
109
+ curl -X POST "http://localhost:8000/search" \
110
+ -F "text=festival music" \
111
112
+ -F "text_weight=0.6" \
113
+ -F "image_weight=0.4"
114
+ ```
115
+
116
+ ---
117
+
118
+ ## 5. Chat Với Chatbot
119
+
120
+ ### 5.1. Chat Cơ Bản (Không RAG)
121
+
122
+ ```python
123
+ import requests
124
+
125
+ response = requests.post('http://localhost:8000/chat', json={
126
+ 'message': 'Xin chào!',
127
+ 'use_rag': False,
128
+ 'hf_token': 'your_huggingface_token'
129
+ })
130
+
131
+ print(response.json()['response'])
132
+ ```
133
+
134
+ ### 5.2. Chat Với RAG (Recommended)
135
+
136
+ ```python
137
+ response = requests.post('http://localhost:8000/chat', json={
138
+ 'message': 'Festival âm nhạc diễn ra khi nào?',
139
+ 'use_rag': True,
140
+ 'use_advanced_rag': True,
141
+ 'top_k': 5,
142
+ 'hf_token': 'your_token'
143
+ })
144
+
145
+ result = response.json()
146
+ print("Answer:", result['response'])
147
+ print("Sources:", result['context_used'])
148
+ ```
149
+
150
+ ### 5.3. Advanced RAG Options
151
+
152
+ ```python
153
+ response = requests.post('http://localhost:8000/chat', json={
154
+ 'message': 'Câu hỏi của bạn',
155
+ 'use_rag': True,
156
+ 'use_advanced_rag': True,
157
+
158
+ # Advanced RAG settings
159
+ 'use_query_expansion': True, # Mở rộng câu hỏi
160
+ 'use_reranking': True, # Rerank kết quả
161
+ 'use_compression': True, # Nén context
162
+ 'score_threshold': 0.5, # Ngưỡng relevance (0-1)
163
+ 'top_k': 5, # Số documents retrieve
164
+
165
+ # LLM settings
166
+ 'max_tokens': 512,
167
+ 'temperature': 0.7,
168
+ 'hf_token': 'your_token'
169
+ })
170
+ ```
171
+
172
+ ---
173
+
174
+ ## 6. Quản Lý Documents
175
+
176
+ ### 6.1. Xem Danh Sách Documents
177
+
178
+ ```bash
179
+ # Xem stats collection
180
+ curl http://localhost:8000/stats
181
+
182
+ # Xem PDFs
183
+ curl http://localhost:8000/documents/pdf
184
+ ```
185
+
186
+ ### 6.2. Get Document By ID
187
+
188
+ ```bash
189
+ curl http://localhost:8000/document/doc123
190
+ ```
191
+
192
+ ### 6.3. Xóa Document
193
+
194
+ ```bash
195
+ curl -X DELETE http://localhost:8000/delete/doc123
196
+ ```
197
+
198
+ ### 6.4. Xóa PDF Document
199
+
200
+ ```bash
201
+ curl -X DELETE http://localhost:8000/documents/pdf/pdf_20251029_143022
202
+ ```
203
+
204
+ ---
205
+
206
+ ## 7. Câu Hỏi Thường Gặp (FAQ)
207
+
208
+ ### Q1: Làm sao để upload PDF vào hệ thống?
209
+
210
+ **A:** Sử dụng endpoint `/upload-pdf`:
211
+
212
+ ```bash
213
+ curl -X POST "http://localhost:8000/upload-pdf" \
214
+ -F "file=@your_file.pdf" \
215
+ -F "title=Tên tài liệu"
216
+ ```
217
+
218
+ ### Q2: Chatbot không tìm thấy thông tin phù hợp?
219
+
220
+ **A:** Thử các cách sau:
221
+ 1. Giảm `score_threshold` xuống (0.3 - 0.5)
222
+ 2. Tăng `top_k` lên (5-10)
223
+ 3. Sử dụng `use_advanced_rag=True`
224
+ 4. Rephrase câu hỏi rõ ràng hơn
225
+
226
+ ### Q3: Làm sao để cải thi��n độ chính xác của chatbot?
227
+
228
+ **A:**
229
+ - Bật Advanced RAG: `use_advanced_rag=True`
230
+ - Bật tất cả RAG features: `use_reranking=True`, `use_compression=True`
231
+ - Index nhiều documents với nội dung chi tiết
232
+ - Sử dụng metadata phù hợp khi index
233
+
234
+ ### Q4: Token limit của LLM là bao nhiêu?
235
+
236
+ **A:** Mặc định `max_tokens=512`. Bạn có thể tăng lên trong request:
237
+
238
+ ```python
239
+ {
240
+ 'message': 'Your question',
241
+ 'max_tokens': 1024, # Tăng lên
242
+ 'hf_token': 'your_token'
243
+ }
244
+ ```
245
+
246
+ ### Q5: Có thể upload bao nhiêu texts/images cùng lúc?
247
+
248
+ **A:** Tối đa **10 texts** và **10 images** mỗi request tại endpoint `/index`.
249
+
250
+ ### Q6: Chatbot có support tiếng Việt không?
251
+
252
+ **A:** Có! Hệ thống sử dụng Jina CLIP v2 hỗ trợ đa ngôn ngữ, bao gồm tiếng Việt.
253
+
254
+ ### Q7: Làm sao để xem lịch sử chat?
255
+
256
+ **A:**
257
+ ```bash
258
+ curl "http://localhost:8000/history?limit=10&skip=0"
259
+ ```
260
+
261
+ ### Q8: PDF của tôi có nhiều hình ảnh, có vấn đề gì không?
262
+
263
+ **A:** Hệ thống hiện chỉ extract text từ PDF. Hình ảnh trong PDF chưa được xử lý. Nếu cần xử lý hình ảnh trong PDF, có thể integrate RAG-Anything sau.
264
+
265
+ ---
266
+
267
+ ## 8. API Reference
268
+
269
+ ### Endpoints Chính
270
+
271
+ | Endpoint | Method | Mô tả |
272
+ |----------|--------|-------|
273
+ | `/` | GET | Health check & API docs |
274
+ | `/index` | POST | Index texts + images (tối đa 10 mỗi loại) |
275
+ | `/search` | POST | Hybrid search (text + image) |
276
+ | `/search/text` | POST | Search chỉ bằng text |
277
+ | `/search/image` | POST | Search chỉ bằng image |
278
+ | `/chat` | POST | Chat với RAG |
279
+ | `/documents` | POST | Add text document |
280
+ | `/upload-pdf` | POST | Upload và index PDF |
281
+ | `/documents/pdf` | GET | List PDFs |
282
+ | `/documents/pdf/{id}` | DELETE | Delete PDF |
283
+ | `/history` | GET | Get chat history |
284
+ | `/stats` | GET | Collection statistics |
285
+
286
+ ### Request Examples
287
+
288
+ **Index with multiple texts:**
289
+ ```json
290
+ POST /index
291
+ {
292
+ "id": "doc123",
293
+ "texts": ["Text 1", "Text 2", "Text 3"]
294
+ }
295
+ ```
296
+
297
+ **Chat with Advanced RAG:**
298
+ ```json
299
+ POST /chat
300
+ {
301
+ "message": "Your question",
302
+ "use_rag": true,
303
+ "use_advanced_rag": true,
304
+ "use_reranking": true,
305
+ "top_k": 5,
306
+ "score_threshold": 0.5,
307
+ "hf_token": "hf_xxxxx"
308
+ }
309
+ ```
310
+
311
+ ---
312
+
313
+ ## 9. Best Practices
314
+
315
+ ### Index Dữ Liệu
316
+ ✓ Chia nhỏ nội dung thành các chunks có nghĩa
317
+ ✓ Thêm metadata đầy đủ (title, category, source)
318
+ ✓ Sử dụng texts array cho multiple paragraphs
319
+ ✗ Tránh index text quá dài trong 1 chunk
320
+
321
+ ### Chat
322
+ ✓ Bật Advanced RAG cho câu hỏi phức tạp
323
+ ✓ Điều chỉnh `top_k` và `score_threshold` phù hợp
324
+ ✓ Sử dụng `temperature` thấp (0.3-0.5) cho câu trả lời factual
325
+ ✗ Tránh đặt `score_threshold` quá cao (>0.8)
326
+
327
+ ### PDF
328
+ ✓ PDF có text layer (không phải scanned image)
329
+ ✓ Cấu trúc rõ ràng với headings, paragraphs
330
+ ✓ Nội dung ngắn gọn, dễ hiểu
331
+ ✗ Tránh PDF quá nhiều hình ảnh phức tạp
332
+
333
+ ---
334
+
335
+ ## 10. Troubleshooting
336
+
337
+ ### Server không khởi động
338
+ - Kiểm tra dependencies: `pip install -r requirements.txt`
339
+ - Kiểm tra MongoDB connection string
340
+ - Kiểm tra Qdrant service
341
+
342
+ ### Upload PDF lỗi
343
+ - Verify file là PDF hợp lệ
344
+ - Check file không bị corrupt
345
+ - Thử convert lại PDF nếu cần
346
+
347
+ ### Chatbot không trả lời đúng
348
+ - Kiểm tra documents đã được index chưa: `/stats`
349
+ - Thử giảm `score_threshold`
350
+ - Bật Advanced RAG options
351
+ - Check LLM token (Hugging Face)
352
+
353
+ ### Out of memory
354
+ - Giảm `chunk_size` trong PDF parser
355
+ - Giảm `top_k` trong chat request
356
+ - Index ít documents hơn mỗi lần
357
+
358
+ ---
359
+
360
+ ## 11. Liên Hệ & Support
361
+
362
+ Nếu có thắc mắc hoặc vấn đề:
363
+ - Check server logs
364
+ - Review API documentation tại `/docs`
365
+ - Xem GitHub issues
366
+
367
+ ---
368
+
369
+ **Happy Chatting! 🤖**
chatbot_rag.py ADDED
@@ -0,0 +1,351 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from huggingface_hub import InferenceClient
3
+ from pymongo import MongoClient
4
+ from datetime import datetime
5
+ from typing import List, Dict
6
+ import numpy as np
7
+
8
+ from embedding_service import JinaClipEmbeddingService
9
+ from qdrant_service import QdrantVectorService
10
+
11
+
12
+ class ChatbotRAG:
13
+ """
14
+ Chatbot RAG với:
15
+ - LLM: GPT-OSS-20B (Hugging Face)
16
+ - Embeddings: Jina CLIP v2
17
+ - Vector DB: Qdrant
18
+ - Document Store: MongoDB
19
+ """
20
+
21
+ def __init__(
22
+ self,
23
+ mongodb_uri: str = "mongodb+srv://truongtn7122003:[email protected]/",
24
+ db_name: str = "chatbot_rag",
25
+ collection_name: str = "documents"
26
+ ):
27
+ """
28
+ Initialize ChatbotRAG
29
+
30
+ Args:
31
+ mongodb_uri: MongoDB connection string
32
+ db_name: Database name
33
+ collection_name: Collection name for documents
34
+ """
35
+ print("Initializing ChatbotRAG...")
36
+
37
+ # MongoDB client
38
+ self.mongo_client = MongoClient(mongodb_uri)
39
+ self.db = self.mongo_client[db_name]
40
+ self.documents_collection = self.db[collection_name]
41
+ self.chat_history_collection = self.db["chat_history"]
42
+
43
+ # Embedding service (Jina CLIP v2)
44
+ self.embedding_service = JinaClipEmbeddingService(
45
+ model_path="jinaai/jina-clip-v2"
46
+ )
47
+
48
+ # Qdrant vector service
49
+ self.qdrant_service = QdrantVectorService(
50
+ collection_name="chatbot_rag_vectors",
51
+ vector_size=self.embedding_service.get_embedding_dimension()
52
+ )
53
+
54
+ print("✓ ChatbotRAG initialized successfully")
55
+
56
+ def add_document(self, text: str, metadata: Dict = None) -> str:
57
+ """
58
+ Add document to MongoDB and Qdrant
59
+
60
+ Args:
61
+ text: Document text
62
+ metadata: Additional metadata
63
+
64
+ Returns:
65
+ Document ID
66
+ """
67
+ # Save to MongoDB
68
+ doc_data = {
69
+ "text": text,
70
+ "metadata": metadata or {},
71
+ "created_at": datetime.utcnow()
72
+ }
73
+ result = self.documents_collection.insert_one(doc_data)
74
+ doc_id = str(result.inserted_id)
75
+
76
+ # Generate embedding
77
+ embedding = self.embedding_service.encode_text(text)
78
+
79
+ # Index to Qdrant
80
+ self.qdrant_service.index_data(
81
+ doc_id=doc_id,
82
+ embedding=embedding,
83
+ metadata={
84
+ "text": text,
85
+ "source": "user_upload",
86
+ **(metadata or {})
87
+ }
88
+ )
89
+
90
+ return doc_id
91
+
92
+ def retrieve_context(self, query: str, top_k: int = 3) -> List[Dict]:
93
+ """
94
+ Retrieve relevant context from vector DB
95
+
96
+ Args:
97
+ query: User query
98
+ top_k: Number of results to retrieve
99
+
100
+ Returns:
101
+ List of relevant documents
102
+ """
103
+ # Generate query embedding
104
+ query_embedding = self.embedding_service.encode_text(query)
105
+
106
+ # Search in Qdrant
107
+ results = self.qdrant_service.search(
108
+ query_embedding=query_embedding,
109
+ limit=top_k,
110
+ score_threshold=0.5 # Only get relevant results
111
+ )
112
+
113
+ return results
114
+
115
+ def save_chat_history(self, user_message: str, assistant_response: str, context_used: List[Dict]):
116
+ """
117
+ Save chat interaction to MongoDB
118
+
119
+ Args:
120
+ user_message: User's message
121
+ assistant_response: Assistant's response
122
+ context_used: Context retrieved from RAG
123
+ """
124
+ chat_data = {
125
+ "user_message": user_message,
126
+ "assistant_response": assistant_response,
127
+ "context_used": context_used,
128
+ "timestamp": datetime.utcnow()
129
+ }
130
+ self.chat_history_collection.insert_one(chat_data)
131
+
132
+ def respond(
133
+ self,
134
+ message: str,
135
+ history: List[Dict[str, str]],
136
+ system_message: str,
137
+ max_tokens: int,
138
+ temperature: float,
139
+ top_p: float,
140
+ use_rag: bool,
141
+ hf_token: gr.OAuthToken,
142
+ ):
143
+ """
144
+ Generate response with RAG
145
+
146
+ Args:
147
+ message: User message
148
+ history: Chat history
149
+ system_message: System prompt
150
+ max_tokens: Max tokens to generate
151
+ temperature: Temperature for generation
152
+ top_p: Top-p sampling
153
+ use_rag: Whether to use RAG retrieval
154
+ hf_token: Hugging Face token
155
+
156
+ Yields:
157
+ Generated response
158
+ """
159
+ # Initialize LLM client
160
+ client = InferenceClient(token=hf_token.token, model="openai/gpt-oss-20b")
161
+
162
+ # Prepare context from RAG
163
+ context_text = ""
164
+ context_used = []
165
+
166
+ if use_rag:
167
+ # Retrieve relevant context
168
+ retrieved_docs = self.retrieve_context(message, top_k=3)
169
+ context_used = retrieved_docs
170
+
171
+ if retrieved_docs:
172
+ context_text = "\n\n**Relevant Context:**\n"
173
+ for i, doc in enumerate(retrieved_docs, 1):
174
+ doc_text = doc["metadata"].get("text", "")
175
+ confidence = doc["confidence"]
176
+ context_text += f"\n[{i}] (Confidence: {confidence:.2f})\n{doc_text}\n"
177
+
178
+ # Add context to system message
179
+ system_message = f"{system_message}\n\n{context_text}\n\nPlease use the above context to answer the user's question when relevant."
180
+
181
+ # Build messages for LLM
182
+ messages = [{"role": "system", "content": system_message}]
183
+ messages.extend(history)
184
+ messages.append({"role": "user", "content": message})
185
+
186
+ # Generate response
187
+ response = ""
188
+
189
+ try:
190
+ for msg in client.chat_completion(
191
+ messages,
192
+ max_tokens=max_tokens,
193
+ stream=True,
194
+ temperature=temperature,
195
+ top_p=top_p,
196
+ ):
197
+ choices = msg.choices
198
+ token = ""
199
+ if len(choices) and choices[0].delta.content:
200
+ token = choices[0].delta.content
201
+
202
+ response += token
203
+ yield response
204
+
205
+ # Save to chat history
206
+ self.save_chat_history(message, response, context_used)
207
+
208
+ except Exception as e:
209
+ error_msg = f"Error generating response: {str(e)}"
210
+ yield error_msg
211
+
212
+
213
+ # Initialize ChatbotRAG
214
+ chatbot_rag = ChatbotRAG()
215
+
216
+
217
+ def respond_wrapper(
218
+ message,
219
+ history,
220
+ system_message,
221
+ max_tokens,
222
+ temperature,
223
+ top_p,
224
+ use_rag,
225
+ hf_token,
226
+ ):
227
+ """Wrapper for Gradio ChatInterface"""
228
+ yield from chatbot_rag.respond(
229
+ message=message,
230
+ history=history,
231
+ system_message=system_message,
232
+ max_tokens=max_tokens,
233
+ temperature=temperature,
234
+ top_p=top_p,
235
+ use_rag=use_rag,
236
+ hf_token=hf_token,
237
+ )
238
+
239
+
240
+ def add_document_to_rag(text: str) -> str:
241
+ """
242
+ Add document to RAG knowledge base
243
+
244
+ Args:
245
+ text: Document text
246
+
247
+ Returns:
248
+ Success message
249
+ """
250
+ try:
251
+ doc_id = chatbot_rag.add_document(text)
252
+ return f"✓ Document added successfully! ID: {doc_id}"
253
+ except Exception as e:
254
+ return f"✗ Error adding document: {str(e)}"
255
+
256
+
257
+ # Create Gradio interface
258
+ with gr.Blocks(title="ChatbotRAG - GPT-OSS-20B + Jina CLIP v2 + MongoDB") as demo:
259
+ gr.Markdown("""
260
+ # 🤖 ChatbotRAG
261
+
262
+ **Features:**
263
+ - 💬 LLM: GPT-OSS-20B
264
+ - 🔍 Embeddings: Jina CLIP v2 (Vietnamese support)
265
+ - 📊 Vector DB: Qdrant Cloud
266
+ - 🗄️ Document Store: MongoDB
267
+
268
+ **How to use:**
269
+ 1. Add documents to knowledge base (optional)
270
+ 2. Toggle "Use RAG" to enable context retrieval
271
+ 3. Chat with the bot!
272
+ """)
273
+
274
+ with gr.Sidebar():
275
+ gr.LoginButton()
276
+
277
+ gr.Markdown("### ⚙️ Settings")
278
+
279
+ use_rag = gr.Checkbox(
280
+ label="Use RAG",
281
+ value=True,
282
+ info="Enable RAG to retrieve relevant context from knowledge base"
283
+ )
284
+
285
+ system_message = gr.Textbox(
286
+ value="You are a helpful AI assistant. Answer questions based on the provided context when available.",
287
+ label="System message",
288
+ lines=3
289
+ )
290
+
291
+ max_tokens = gr.Slider(
292
+ minimum=1,
293
+ maximum=2048,
294
+ value=512,
295
+ step=1,
296
+ label="Max new tokens"
297
+ )
298
+
299
+ temperature = gr.Slider(
300
+ minimum=0.1,
301
+ maximum=4.0,
302
+ value=0.7,
303
+ step=0.1,
304
+ label="Temperature"
305
+ )
306
+
307
+ top_p = gr.Slider(
308
+ minimum=0.1,
309
+ maximum=1.0,
310
+ value=0.95,
311
+ step=0.05,
312
+ label="Top-p (nucleus sampling)"
313
+ )
314
+
315
+ # Chat interface
316
+ chatbot = gr.ChatInterface(
317
+ respond_wrapper,
318
+ type="messages",
319
+ additional_inputs=[
320
+ system_message,
321
+ max_tokens,
322
+ temperature,
323
+ top_p,
324
+ use_rag,
325
+ ],
326
+ )
327
+
328
+ # Document management
329
+ with gr.Accordion("📚 Knowledge Base Management", open=False):
330
+ gr.Markdown("### Add Documents to Knowledge Base")
331
+
332
+ doc_text = gr.Textbox(
333
+ label="Document Text",
334
+ placeholder="Enter document text here...",
335
+ lines=5
336
+ )
337
+
338
+ add_btn = gr.Button("Add Document", variant="primary")
339
+ output_msg = gr.Textbox(label="Status", interactive=False)
340
+
341
+ add_btn.click(
342
+ fn=add_document_to_rag,
343
+ inputs=[doc_text],
344
+ outputs=[output_msg]
345
+ )
346
+
347
+ chatbot.render()
348
+
349
+
350
+ if __name__ == "__main__":
351
+ demo.launch(server_name="0.0.0.0", server_port=7860)
chatbot_rag_api.py ADDED
@@ -0,0 +1,468 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, HTTPException, File, UploadFile, Form
2
+ from fastapi.middleware.cors import CORSMiddleware
3
+ from pydantic import BaseModel
4
+ from typing import Optional, List, Dict
5
+ from pymongo import MongoClient
6
+ from datetime import datetime
7
+ import numpy as np
8
+ import os
9
+ from huggingface_hub import InferenceClient
10
+
11
+ from embedding_service import JinaClipEmbeddingService
12
+ from qdrant_service import QdrantVectorService
13
+
14
+
15
+ # Pydantic models
16
+ class ChatRequest(BaseModel):
17
+ message: str
18
+ use_rag: bool = True
19
+ top_k: int = 3
20
+ system_message: Optional[str] = "You are a helpful AI assistant."
21
+ max_tokens: int = 512
22
+ temperature: float = 0.7
23
+ top_p: float = 0.95
24
+ hf_token: Optional[str] = None # Hugging Face token (optional, sẽ dùng env nếu không truyền)
25
+
26
+
27
+ class ChatResponse(BaseModel):
28
+ response: str
29
+ context_used: List[Dict]
30
+ timestamp: str
31
+
32
+
33
+ class AddDocumentRequest(BaseModel):
34
+ text: str
35
+ metadata: Optional[Dict] = None
36
+
37
+
38
+ class AddDocumentResponse(BaseModel):
39
+ success: bool
40
+ doc_id: str
41
+ message: str
42
+
43
+
44
+ class SearchRequest(BaseModel):
45
+ query: str
46
+ top_k: int = 5
47
+ score_threshold: Optional[float] = 0.5
48
+
49
+
50
+ class SearchResponse(BaseModel):
51
+ results: List[Dict]
52
+
53
+
54
+ # Initialize FastAPI
55
+ app = FastAPI(
56
+ title="ChatbotRAG API",
57
+ description="API for RAG Chatbot with GPT-OSS-20B + Jina CLIP v2 + MongoDB + Qdrant",
58
+ version="1.0.0"
59
+ )
60
+
61
+ # CORS middleware
62
+ app.add_middleware(
63
+ CORSMiddleware,
64
+ allow_origins=["*"], # Cho phép tất cả origins (có thể giới hạn trong production)
65
+ allow_credentials=True,
66
+ allow_methods=["*"],
67
+ allow_headers=["*"],
68
+ )
69
+
70
+
71
+ # ChatbotRAG Service
72
+ class ChatbotRAGService:
73
+ """
74
+ ChatbotRAG Service cho API
75
+ """
76
+
77
+ def __init__(
78
+ self,
79
+ mongodb_uri: str = "mongodb+srv://truongtn7122003:[email protected]/",
80
+ db_name: str = "chatbot_rag",
81
+ collection_name: str = "documents",
82
+ hf_token: Optional[str] = None
83
+ ):
84
+ print("Initializing ChatbotRAG Service...")
85
+
86
+ # MongoDB
87
+ self.mongo_client = MongoClient(mongodb_uri)
88
+ self.db = self.mongo_client[db_name]
89
+ self.documents_collection = self.db[collection_name]
90
+ self.chat_history_collection = self.db["chat_history"]
91
+
92
+ # Embedding service
93
+ self.embedding_service = JinaClipEmbeddingService(
94
+ model_path="jinaai/jina-clip-v2"
95
+ )
96
+
97
+ # Qdrant
98
+ collection_name = os.getenv("COLLECTION_NAME","event_social_media")
99
+ self.qdrant_service = QdrantVectorService(
100
+ collection_name= collection_name,
101
+ vector_size=self.embedding_service.get_embedding_dimension()
102
+ )
103
+
104
+ # Hugging Face token (từ env hoặc truyền vào)
105
+ self.hf_token = hf_token or os.getenv("HUGGINGFACE_TOKEN")
106
+ if self.hf_token:
107
+ print("✓ Hugging Face token configured")
108
+ else:
109
+ print("⚠ No Hugging Face token - LLM generation will use placeholder")
110
+
111
+ print("✓ ChatbotRAG Service initialized")
112
+
113
+ def add_document(self, text: str, metadata: Dict = None) -> str:
114
+ """Add document to knowledge base"""
115
+ # Save to MongoDB
116
+ doc_data = {
117
+ "text": text,
118
+ "metadata": metadata or {},
119
+ "created_at": datetime.utcnow()
120
+ }
121
+ result = self.documents_collection.insert_one(doc_data)
122
+ doc_id = str(result.inserted_id)
123
+
124
+ # Generate embedding
125
+ embedding = self.embedding_service.encode_text(text)
126
+
127
+ # Index to Qdrant
128
+ self.qdrant_service.index_data(
129
+ doc_id=doc_id,
130
+ embedding=embedding,
131
+ metadata={
132
+ "text": text,
133
+ "source": "api",
134
+ **(metadata or {})
135
+ }
136
+ )
137
+
138
+ return doc_id
139
+
140
+ def retrieve_context(self, query: str, top_k: int = 3, score_threshold: float = 0.5) -> List[Dict]:
141
+ """Retrieve relevant context from vector DB"""
142
+ # Generate query embedding
143
+ query_embedding = self.embedding_service.encode_text(query)
144
+
145
+ # Search in Qdrant
146
+ results = self.qdrant_service.search(
147
+ query_embedding=query_embedding,
148
+ limit=top_k,
149
+ score_threshold=score_threshold
150
+ )
151
+
152
+ return results
153
+
154
+ def generate_response(
155
+ self,
156
+ message: str,
157
+ context: List[Dict],
158
+ system_message: str,
159
+ max_tokens: int = 512,
160
+ temperature: float = 0.7,
161
+ top_p: float = 0.95,
162
+ hf_token: Optional[str] = None
163
+ ) -> str:
164
+ """
165
+ Generate response using Hugging Face LLM
166
+ """
167
+ # Build context text
168
+ context_text = ""
169
+ if context:
170
+ context_text = "\n\nRelevant Context:\n"
171
+ for i, doc in enumerate(context, 1):
172
+ doc_text = doc["metadata"].get("text", "")
173
+ confidence = doc["confidence"]
174
+ context_text += f"\n[{i}] (Confidence: {confidence:.2f})\n{doc_text}\n"
175
+
176
+ # Add context to system message
177
+ system_message = f"{system_message}\n{context_text}\n\nPlease use the above context to answer the user's question when relevant."
178
+
179
+ # Use token from request or fallback to service token
180
+ token = hf_token or self.hf_token
181
+
182
+ # If no token available, return placeholder
183
+ if not token:
184
+ return f"""[LLM Response Placeholder]
185
+
186
+ Context retrieved: {len(context)} documents
187
+ User question: {message}
188
+
189
+ To enable actual LLM generation:
190
+ 1. Set HUGGINGFACE_TOKEN environment variable, OR
191
+ 2. Pass hf_token in request body
192
+
193
+ Example:
194
+ {{
195
+ "message": "Your question",
196
+ "hf_token": "hf_xxxxxxxxxxxxx"
197
+ }}
198
+ """
199
+
200
+ # Initialize HF Inference Client
201
+ try:
202
+ client = InferenceClient(
203
+ token=token,
204
+ model="openai/gpt-oss-20b"
205
+ )
206
+
207
+ # Build messages
208
+ messages = [
209
+ {"role": "system", "content": system_message},
210
+ {"role": "user", "content": message}
211
+ ]
212
+
213
+ # Generate response (non-streaming for API)
214
+ response = ""
215
+ for msg in client.chat_completion(
216
+ messages,
217
+ max_tokens=max_tokens,
218
+ stream=True,
219
+ temperature=temperature,
220
+ top_p=top_p,
221
+ ):
222
+ choices = msg.choices
223
+ if len(choices) and choices[0].delta.content:
224
+ response += choices[0].delta.content
225
+
226
+ return response
227
+
228
+ except Exception as e:
229
+ return f"Error generating response with LLM: {str(e)}\n\nContext was retrieved successfully, but LLM generation failed."
230
+
231
+ def save_chat_history(self, user_message: str, assistant_response: str, context_used: List[Dict]):
232
+ """Save chat to MongoDB"""
233
+ chat_data = {
234
+ "user_message": user_message,
235
+ "assistant_response": assistant_response,
236
+ "context_used": context_used,
237
+ "timestamp": datetime.utcnow()
238
+ }
239
+ self.chat_history_collection.insert_one(chat_data)
240
+
241
+ def get_stats(self) -> Dict:
242
+ """Get statistics"""
243
+ return {
244
+ "documents_count": self.documents_collection.count_documents({}),
245
+ "chat_history_count": self.chat_history_collection.count_documents({}),
246
+ "qdrant_info": self.qdrant_service.get_collection_info()
247
+ }
248
+
249
+
250
+ # Initialize service
251
+ rag_service = ChatbotRAGService()
252
+
253
+
254
+ # API Endpoints
255
+
256
+ @app.get("/")
257
+ async def root():
258
+ """Health check"""
259
+ return {
260
+ "status": "running",
261
+ "service": "ChatbotRAG API",
262
+ "version": "1.0.0",
263
+ "endpoints": {
264
+ "POST /chat": "Chat with RAG",
265
+ "POST /documents": "Add document to knowledge base",
266
+ "POST /search": "Search in knowledge base",
267
+ "GET /stats": "Get statistics",
268
+ "GET /history": "Get chat history"
269
+ }
270
+ }
271
+
272
+
273
+ @app.post("/chat", response_model=ChatResponse)
274
+ async def chat(request: ChatRequest):
275
+ """
276
+ Chat endpoint with RAG
277
+
278
+ Body:
279
+ - message: User message
280
+ - use_rag: Enable RAG retrieval (default: true)
281
+ - top_k: Number of documents to retrieve (default: 3)
282
+ - system_message: System prompt (optional)
283
+ - max_tokens: Max tokens for response (default: 512)
284
+ - temperature: Temperature for generation (default: 0.7)
285
+
286
+ Returns:
287
+ - response: Generated response
288
+ - context_used: Retrieved context documents
289
+ - timestamp: Response timestamp
290
+ """
291
+ try:
292
+ # Retrieve context if RAG enabled
293
+ context_used = []
294
+ if request.use_rag:
295
+ context_used = rag_service.retrieve_context(
296
+ query=request.message,
297
+ top_k=request.top_k
298
+ )
299
+
300
+ # Generate response
301
+ response = rag_service.generate_response(
302
+ message=request.message,
303
+ context=context_used,
304
+ system_message=request.system_message,
305
+ max_tokens=request.max_tokens,
306
+ temperature=request.temperature,
307
+ top_p=request.top_p,
308
+ hf_token=request.hf_token
309
+ )
310
+
311
+ # Save to history
312
+ rag_service.save_chat_history(
313
+ user_message=request.message,
314
+ assistant_response=response,
315
+ context_used=context_used
316
+ )
317
+
318
+ return ChatResponse(
319
+ response=response,
320
+ context_used=context_used,
321
+ timestamp=datetime.utcnow().isoformat()
322
+ )
323
+
324
+ except Exception as e:
325
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
326
+
327
+
328
+ @app.post("/documents", response_model=AddDocumentResponse)
329
+ async def add_document(request: AddDocumentRequest):
330
+ """
331
+ Add document to knowledge base
332
+
333
+ Body:
334
+ - text: Document text
335
+ - metadata: Additional metadata (optional)
336
+
337
+ Returns:
338
+ - success: True/False
339
+ - doc_id: MongoDB document ID
340
+ - message: Status message
341
+ """
342
+ try:
343
+ doc_id = rag_service.add_document(
344
+ text=request.text,
345
+ metadata=request.metadata
346
+ )
347
+
348
+ return AddDocumentResponse(
349
+ success=True,
350
+ doc_id=doc_id,
351
+ message=f"Document added successfully with ID: {doc_id}"
352
+ )
353
+
354
+ except Exception as e:
355
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
356
+
357
+
358
+ @app.post("/search", response_model=SearchResponse)
359
+ async def search(request: SearchRequest):
360
+ """
361
+ Search in knowledge base
362
+
363
+ Body:
364
+ - query: Search query
365
+ - top_k: Number of results (default: 5)
366
+ - score_threshold: Minimum score (default: 0.5)
367
+
368
+ Returns:
369
+ - results: List of matching documents
370
+ """
371
+ try:
372
+ results = rag_service.retrieve_context(
373
+ query=request.query,
374
+ top_k=request.top_k,
375
+ score_threshold=request.score_threshold
376
+ )
377
+
378
+ return SearchResponse(results=results)
379
+
380
+ except Exception as e:
381
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
382
+
383
+
384
+ @app.get("/stats")
385
+ async def get_stats():
386
+ """
387
+ Get statistics
388
+
389
+ Returns:
390
+ - documents_count: Number of documents in MongoDB
391
+ - chat_history_count: Number of chat messages
392
+ - qdrant_info: Qdrant collection info
393
+ """
394
+ try:
395
+ return rag_service.get_stats()
396
+ except Exception as e:
397
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
398
+
399
+
400
+ @app.get("/history")
401
+ async def get_history(limit: int = 10, skip: int = 0):
402
+ """
403
+ Get chat history
404
+
405
+ Query params:
406
+ - limit: Number of messages to return (default: 10)
407
+ - skip: Number of messages to skip (default: 0)
408
+
409
+ Returns:
410
+ - history: List of chat messages
411
+ """
412
+ try:
413
+ history = list(
414
+ rag_service.chat_history_collection
415
+ .find({}, {"_id": 0})
416
+ .sort("timestamp", -1)
417
+ .skip(skip)
418
+ .limit(limit)
419
+ )
420
+
421
+ # Convert datetime to string
422
+ for msg in history:
423
+ if "timestamp" in msg:
424
+ msg["timestamp"] = msg["timestamp"].isoformat()
425
+
426
+ return {"history": history, "total": rag_service.chat_history_collection.count_documents({})}
427
+
428
+ except Exception as e:
429
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
430
+
431
+
432
+ @app.delete("/documents/{doc_id}")
433
+ async def delete_document(doc_id: str):
434
+ """
435
+ Delete document from knowledge base
436
+
437
+ Args:
438
+ - doc_id: Document ID (MongoDB ObjectId)
439
+
440
+ Returns:
441
+ - success: True/False
442
+ - message: Status message
443
+ """
444
+ try:
445
+ # Delete from MongoDB
446
+ result = rag_service.documents_collection.delete_one({"_id": doc_id})
447
+
448
+ # Delete from Qdrant
449
+ if result.deleted_count > 0:
450
+ rag_service.qdrant_service.delete_by_id(doc_id)
451
+ return {"success": True, "message": f"Document {doc_id} deleted"}
452
+ else:
453
+ raise HTTPException(status_code=404, detail=f"Document {doc_id} not found")
454
+
455
+ except HTTPException:
456
+ raise
457
+ except Exception as e:
458
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
459
+
460
+
461
+ if __name__ == "__main__":
462
+ import uvicorn
463
+ uvicorn.run(
464
+ app,
465
+ host="0.0.0.0",
466
+ port=8000,
467
+ log_level="info"
468
+ )
embedding_service.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ from PIL import Image
4
+ from transformers import AutoModel
5
+ from typing import Union, List
6
+ import io
7
+
8
+
9
+ class JinaClipEmbeddingService:
10
+ """
11
+ Jina CLIP v2 Embedding Service với hỗ trợ tiếng Việt
12
+ Sử dụng AutoModel với trust_remote_code
13
+ """
14
+
15
+ def __init__(self, model_path: str = "jinaai/jina-clip-v2"):
16
+ """
17
+ Initialize Jina CLIP v2 model
18
+
19
+ Args:
20
+ model_path: Path to model hoặc HuggingFace model name
21
+ """
22
+ print(f"Loading Jina CLIP v2 model from {model_path}...")
23
+
24
+ # Load model với trust_remote_code
25
+ self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
26
+
27
+ # Chuyển sang eval mode
28
+ self.model.eval()
29
+
30
+ # Sử dụng GPU nếu có
31
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
32
+ self.model.to(self.device)
33
+
34
+ print(f"✓ Loaded Jina CLIP v2 model on: {self.device}")
35
+
36
+ def encode_text(
37
+ self,
38
+ text: Union[str, List[str]],
39
+ truncate_dim: int = None,
40
+ normalize: bool = True
41
+ ) -> np.ndarray:
42
+ """
43
+ Encode text thành vector embeddings (hỗ trợ tiếng Việt)
44
+
45
+ Args:
46
+ text: Text hoặc list of texts (tiếng Việt)
47
+ truncate_dim: Matryoshka dimension (64-1024, None = full 1024)
48
+ normalize: Có normalize embeddings không
49
+
50
+ Returns:
51
+ numpy array của embeddings
52
+ """
53
+ if isinstance(text, str):
54
+ text = [text]
55
+
56
+ # Jina CLIP v2 encode_text method
57
+ # Automatically handles tokenization internally
58
+ embeddings = self.model.encode_text(
59
+ text,
60
+ truncate_dim=truncate_dim # Optional: 64, 128, 256, 512, 1024
61
+ )
62
+
63
+ # Convert to numpy
64
+ if isinstance(embeddings, torch.Tensor):
65
+ embeddings = embeddings.cpu().detach().numpy()
66
+
67
+ # Normalize nếu cần
68
+ if normalize:
69
+ embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
70
+
71
+ return embeddings
72
+
73
+ def encode_image(
74
+ self,
75
+ image: Union[Image.Image, bytes, List, str],
76
+ truncate_dim: int = None,
77
+ normalize: bool = True
78
+ ) -> np.ndarray:
79
+ """
80
+ Encode image thành vector embeddings
81
+
82
+ Args:
83
+ image: PIL Image, bytes, URL string, hoặc list of images
84
+ truncate_dim: Matryoshka dimension (64-1024, None = full 1024)
85
+ normalize: Có normalize embeddings không
86
+
87
+ Returns:
88
+ numpy array của embeddings
89
+ """
90
+ # Convert bytes to PIL Image nếu cần
91
+ if isinstance(image, bytes):
92
+ image = Image.open(io.BytesIO(image)).convert('RGB')
93
+ elif isinstance(image, list):
94
+ processed_images = []
95
+ for img in image:
96
+ if isinstance(img, bytes):
97
+ processed_images.append(Image.open(io.BytesIO(img)).convert('RGB'))
98
+ elif isinstance(img, str):
99
+ # URL string - keep as is, Jina CLIP can handle URLs
100
+ processed_images.append(img)
101
+ else:
102
+ processed_images.append(img)
103
+ image = processed_images
104
+ elif not isinstance(image, list) and not isinstance(image, str):
105
+ # Single PIL Image
106
+ image = [image]
107
+
108
+ # Jina CLIP v2 encode_image method
109
+ # Supports PIL Images, file paths, or URLs
110
+ embeddings = self.model.encode_image(
111
+ image,
112
+ truncate_dim=truncate_dim # Optional: 64, 128, 256, 512, 1024
113
+ )
114
+
115
+ # Convert to numpy
116
+ if isinstance(embeddings, torch.Tensor):
117
+ embeddings = embeddings.cpu().detach().numpy()
118
+
119
+ # Normalize nếu cần
120
+ if normalize:
121
+ embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
122
+
123
+ return embeddings
124
+
125
+ def encode_multimodal(
126
+ self,
127
+ text: Union[str, List[str]] = None,
128
+ image: Union[Image.Image, bytes, List] = None,
129
+ truncate_dim: int = None,
130
+ normalize: bool = True
131
+ ) -> np.ndarray:
132
+ """
133
+ Encode cả text và image, trả về embeddings kết hợp
134
+
135
+ Args:
136
+ text: Text hoặc list of texts (tiếng Việt)
137
+ image: PIL Image, bytes, hoặc list of images
138
+ truncate_dim: Matryoshka dimension (64-1024, None = full 1024)
139
+ normalize: Có normalize embeddings không
140
+
141
+ Returns:
142
+ numpy array của embeddings
143
+ """
144
+ embeddings = []
145
+
146
+ if text is not None:
147
+ text_emb = self.encode_text(text, truncate_dim=truncate_dim, normalize=False)
148
+ embeddings.append(text_emb)
149
+
150
+ if image is not None:
151
+ image_emb = self.encode_image(image, truncate_dim=truncate_dim, normalize=False)
152
+ embeddings.append(image_emb)
153
+
154
+ # Combine embeddings (average)
155
+ if len(embeddings) == 2:
156
+ # Average của text và image embeddings
157
+ combined = np.mean(embeddings, axis=0)
158
+ elif len(embeddings) == 1:
159
+ combined = embeddings[0]
160
+ else:
161
+ raise ValueError("Phải cung cấp ít nhất text hoặc image")
162
+
163
+ # Normalize nếu cần
164
+ if normalize:
165
+ combined = combined / np.linalg.norm(combined, axis=1, keepdims=True)
166
+
167
+ return combined
168
+
169
+ def get_embedding_dimension(self) -> int:
170
+ """
171
+ Trả về dimension của embeddings (1024 cho Jina CLIP v2)
172
+ """
173
+ return 1024
main.py ADDED
@@ -0,0 +1,1285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, UploadFile, File, Form, HTTPException
2
+ from fastapi.responses import JSONResponse
3
+ from fastapi.middleware.cors import CORSMiddleware
4
+ from pydantic import BaseModel
5
+ from typing import Optional, List, Dict
6
+ from PIL import Image
7
+ import io
8
+ import numpy as np
9
+ import os
10
+ from datetime import datetime
11
+ from pymongo import MongoClient
12
+ from huggingface_hub import InferenceClient
13
+
14
+ from embedding_service import JinaClipEmbeddingService
15
+ from qdrant_service import QdrantVectorService
16
+ from advanced_rag import AdvancedRAG
17
+ from pdf_parser import PDFIndexer
18
+ from multimodal_pdf_parser import MultimodalPDFIndexer
19
+
20
+ # Initialize FastAPI app
21
+ app = FastAPI(
22
+ title="Event Social Media Embeddings & ChatbotRAG API",
23
+ description="API để embeddings, search và ChatbotRAG với Jina CLIP v2 + Qdrant + MongoDB + LLM",
24
+ version="2.0.0"
25
+ )
26
+
27
+ # CORS middleware
28
+ app.add_middleware(
29
+ CORSMiddleware,
30
+ allow_origins=["*"],
31
+ allow_credentials=True,
32
+ allow_methods=["*"],
33
+ allow_headers=["*"],
34
+ )
35
+
36
+ # Initialize services
37
+ print("Initializing services...")
38
+ embedding_service = JinaClipEmbeddingService(model_path="jinaai/jina-clip-v2")
39
+
40
+ collection_name = os.getenv("COLLECTION_NAME", "event_social_media")
41
+ qdrant_service = QdrantVectorService(
42
+ collection_name=collection_name,
43
+ vector_size=embedding_service.get_embedding_dimension()
44
+ )
45
+ print(f"✓ Qdrant collection: {collection_name}")
46
+
47
+ # MongoDB connection
48
+ mongodb_uri = os.getenv("MONGODB_URI", "mongodb+srv://truongtn7122003:[email protected]/")
49
+ mongo_client = MongoClient(mongodb_uri)
50
+ db = mongo_client[os.getenv("MONGODB_DB_NAME", "chatbot_rag")]
51
+ documents_collection = db["documents"]
52
+ chat_history_collection = db["chat_history"]
53
+ print("✓ MongoDB connected")
54
+
55
+ # Hugging Face token
56
+ hf_token = os.getenv("HUGGINGFACE_TOKEN")
57
+ if hf_token:
58
+ print("✓ Hugging Face token configured")
59
+
60
+ # Initialize Advanced RAG
61
+ advanced_rag = AdvancedRAG(
62
+ embedding_service=embedding_service,
63
+ qdrant_service=qdrant_service
64
+ )
65
+ print("✓ Advanced RAG pipeline initialized")
66
+
67
+ # Initialize PDF Indexer
68
+ pdf_indexer = PDFIndexer(
69
+ embedding_service=embedding_service,
70
+ qdrant_service=qdrant_service,
71
+ documents_collection=documents_collection
72
+ )
73
+ print("✓ PDF Indexer initialized")
74
+
75
+ # Initialize Multimodal PDF Indexer (for PDFs with images)
76
+ multimodal_pdf_indexer = MultimodalPDFIndexer(
77
+ embedding_service=embedding_service,
78
+ qdrant_service=qdrant_service,
79
+ documents_collection=documents_collection
80
+ )
81
+ print("✓ Multimodal PDF Indexer initialized")
82
+
83
+ print("✓ Services initialized successfully")
84
+
85
+
86
+ # Pydantic models for embeddings
87
+ class SearchRequest(BaseModel):
88
+ text: Optional[str] = None
89
+ limit: int = 10
90
+ score_threshold: Optional[float] = None
91
+ text_weight: float = 0.5
92
+ image_weight: float = 0.5
93
+
94
+
95
+ class SearchResponse(BaseModel):
96
+ id: str
97
+ confidence: float
98
+ metadata: dict
99
+
100
+
101
+ class IndexResponse(BaseModel):
102
+ success: bool
103
+ id: str
104
+ message: str
105
+
106
+
107
+ # Pydantic models for ChatbotRAG
108
+ class ChatRequest(BaseModel):
109
+ message: str
110
+ use_rag: bool = True
111
+ top_k: int = 3
112
+ system_message: Optional[str] = "You are a helpful AI assistant."
113
+ max_tokens: int = 512
114
+ temperature: float = 0.7
115
+ top_p: float = 0.95
116
+ hf_token: Optional[str] = None
117
+ # Advanced RAG options
118
+ use_advanced_rag: bool = True
119
+ use_query_expansion: bool = True
120
+ use_reranking: bool = True
121
+ use_compression: bool = True
122
+ score_threshold: float = 0.5
123
+
124
+
125
+ class ChatResponse(BaseModel):
126
+ response: str
127
+ context_used: List[Dict]
128
+ timestamp: str
129
+ rag_stats: Optional[Dict] = None # Stats from advanced RAG pipeline
130
+
131
+
132
+ class AddDocumentRequest(BaseModel):
133
+ text: str
134
+ metadata: Optional[Dict] = None
135
+
136
+
137
+ class AddDocumentResponse(BaseModel):
138
+ success: bool
139
+ doc_id: str
140
+ message: str
141
+
142
+
143
+ class UploadPDFResponse(BaseModel):
144
+ success: bool
145
+ document_id: str
146
+ filename: str
147
+ chunks_indexed: int
148
+ message: str
149
+
150
+
151
+ @app.get("/")
152
+ async def root():
153
+ """Health check endpoint with comprehensive API documentation"""
154
+ return {
155
+ "status": "running",
156
+ "service": "ChatbotRAG API - Advanced RAG with Multimodal Support",
157
+ "version": "3.0.0",
158
+ "vector_db": "Qdrant",
159
+ "document_db": "MongoDB",
160
+ "features": {
161
+ "multiple_inputs": "Index up to 10 texts + 10 images per request",
162
+ "advanced_rag": "Query expansion, reranking, contextual compression",
163
+ "pdf_support": "Upload PDFs and chat about their content",
164
+ "multimodal_pdf": "PDFs with text and image URLs - perfect for user guides",
165
+ "chat_history": "Track conversation history",
166
+ "hybrid_search": "Text + image search with Jina CLIP v2"
167
+ },
168
+ "endpoints": {
169
+ "indexing": {
170
+ "POST /index": {
171
+ "description": "Index multiple texts and images (NEW: up to 10 each)",
172
+ "content_type": "multipart/form-data",
173
+ "body": {
174
+ "id": "string (required) - Document ID",
175
+ "texts": "List[string] (optional) - Up to 10 texts",
176
+ "images": "List[UploadFile] (optional) - Up to 10 images"
177
+ },
178
+ "example": "curl -X POST '/index' -F 'id=doc1' -F 'texts=Text 1' -F 'texts=Text 2' -F '[email protected]'",
179
+ "response": {
180
+ "success": True,
181
+ "id": "doc1",
182
+ "message": "Indexed successfully with 2 texts and 1 images"
183
+ }
184
+ },
185
+ "POST /documents": {
186
+ "description": "Add text document to knowledge base",
187
+ "content_type": "application/json",
188
+ "body": {
189
+ "text": "string (required) - Document content",
190
+ "metadata": "object (optional) - Additional metadata"
191
+ },
192
+ "example": {
193
+ "text": "How to create event: Click 'Create Event' button...",
194
+ "metadata": {"category": "tutorial", "source": "user_guide"}
195
+ }
196
+ },
197
+ "POST /upload-pdf": {
198
+ "description": "Upload PDF file (text only)",
199
+ "content_type": "multipart/form-data",
200
+ "body": {
201
+ "file": "UploadFile (required) - PDF file",
202
+ "title": "string (optional) - Document title",
203
+ "category": "string (optional) - Category",
204
+ "description": "string (optional) - Description"
205
+ },
206
+ "example": "curl -X POST '/upload-pdf' -F '[email protected]' -F 'title=User Guide'"
207
+ },
208
+ "POST /upload-pdf-multimodal": {
209
+ "description": "Upload PDF with text and image URLs (RECOMMENDED for user guides)",
210
+ "content_type": "multipart/form-data",
211
+ "features": [
212
+ "Extracts text from PDF",
213
+ "Detects image URLs (http://, https://)",
214
+ "Supports markdown: ![alt](url)",
215
+ "Supports HTML: <img src='url'>",
216
+ "Links images to text chunks",
217
+ "Returns images with context in chat"
218
+ ],
219
+ "body": {
220
+ "file": "UploadFile (required) - PDF file with image URLs",
221
+ "title": "string (optional) - Document title",
222
+ "category": "string (optional) - e.g. 'user_guide', 'tutorial'",
223
+ "description": "string (optional)"
224
+ },
225
+ "example": "curl -X POST '/upload-pdf-multimodal' -F 'file=@guide_with_images.pdf' -F 'category=user_guide'",
226
+ "response": {
227
+ "success": True,
228
+ "document_id": "pdf_multimodal_20251029_150000",
229
+ "chunks_indexed": 25,
230
+ "message": "PDF indexed with 25 chunks and 15 images"
231
+ },
232
+ "use_case": "Perfect for user guides with screenshots, tutorials with diagrams"
233
+ }
234
+ },
235
+ "search": {
236
+ "POST /search": {
237
+ "description": "Hybrid search with text and/or image",
238
+ "body": {
239
+ "text": "string (optional) - Query text",
240
+ "image": "UploadFile (optional) - Query image",
241
+ "limit": "int (default: 10)",
242
+ "score_threshold": "float (optional, 0-1)",
243
+ "text_weight": "float (default: 0.5)",
244
+ "image_weight": "float (default: 0.5)"
245
+ }
246
+ },
247
+ "POST /search/text": {
248
+ "description": "Text-only search",
249
+ "body": {"text": "string", "limit": "int", "score_threshold": "float"}
250
+ },
251
+ "POST /search/image": {
252
+ "description": "Image-only search",
253
+ "body": {"image": "UploadFile", "limit": "int", "score_threshold": "float"}
254
+ },
255
+ "POST /rag/search": {
256
+ "description": "Search in RAG knowledge base",
257
+ "body": {"query": "string", "top_k": "int (default: 5)", "score_threshold": "float (default: 0.5)"}
258
+ }
259
+ },
260
+ "chat": {
261
+ "POST /chat": {
262
+ "description": "Chat với Advanced RAG (Query expansion + Reranking + Compression)",
263
+ "content_type": "application/json",
264
+ "body": {
265
+ "message": "string (required) - User question",
266
+ "use_rag": "bool (default: true) - Enable RAG retrieval",
267
+ "use_advanced_rag": "bool (default: true) - Use advanced RAG pipeline (RECOMMENDED)",
268
+ "use_query_expansion": "bool (default: true) - Expand query with variations",
269
+ "use_reranking": "bool (default: true) - Rerank results for accuracy",
270
+ "use_compression": "bool (default: true) - Compress context to relevant parts",
271
+ "top_k": "int (default: 3) - Number of documents to retrieve",
272
+ "score_threshold": "float (default: 0.5) - Min relevance score (0-1)",
273
+ "max_tokens": "int (default: 512) - Max response tokens",
274
+ "temperature": "float (default: 0.7) - Creativity (0-1)",
275
+ "hf_token": "string (optional) - Hugging Face token"
276
+ },
277
+ "response": {
278
+ "response": "string - AI answer",
279
+ "context_used": "array - Retrieved documents with metadata",
280
+ "timestamp": "string",
281
+ "rag_stats": "object - RAG pipeline statistics (query variants, retrieval counts)"
282
+ },
283
+ "example_advanced": {
284
+ "message": "Làm sao để upload PDF có hình ảnh?",
285
+ "use_advanced_rag": True,
286
+ "use_reranking": True,
287
+ "top_k": 5,
288
+ "score_threshold": 0.5
289
+ },
290
+ "example_response_with_images": {
291
+ "response": "Để upload PDF có hình ảnh, sử dụng endpoint /upload-pdf-multimodal...",
292
+ "context_used": [
293
+ {
294
+ "id": "pdf_multimodal_...._p2_c1",
295
+ "confidence": 0.89,
296
+ "metadata": {
297
+ "text": "Bước 1: Chuẩn bị PDF với image URLs...",
298
+ "has_images": True,
299
+ "image_urls": [
300
+ "https://example.com/screenshot1.png",
301
+ "https://example.com/diagram.jpg"
302
+ ],
303
+ "num_images": 2,
304
+ "page": 2
305
+ }
306
+ }
307
+ ],
308
+ "rag_stats": {
309
+ "original_query": "Làm sao để upload PDF có hình ảnh?",
310
+ "expanded_queries": ["upload PDF hình ảnh", "PDF có ảnh"],
311
+ "initial_results": 10,
312
+ "after_rerank": 5,
313
+ "after_compression": 5
314
+ }
315
+ },
316
+ "notes": [
317
+ "Advanced RAG significantly improves answer quality",
318
+ "When multimodal PDF is used, images are returned in metadata",
319
+ "Requires HUGGINGFACE_TOKEN for actual LLM generation"
320
+ ]
321
+ },
322
+ "GET /history": {
323
+ "description": "Get chat history",
324
+ "query_params": {"limit": "int (default: 10)", "skip": "int (default: 0)"},
325
+ "response": {"history": "array", "total": "int"}
326
+ }
327
+ },
328
+ "management": {
329
+ "GET /documents/pdf": {
330
+ "description": "List all PDF documents",
331
+ "response": {"documents": "array", "total": "int"}
332
+ },
333
+ "DELETE /documents/pdf/{document_id}": {
334
+ "description": "Delete PDF and all its chunks",
335
+ "response": {"success": "bool", "message": "string"}
336
+ },
337
+ "GET /document/{doc_id}": {
338
+ "description": "Get document by ID",
339
+ "response": {"success": "bool", "data": "object"}
340
+ },
341
+ "DELETE /delete/{doc_id}": {
342
+ "description": "Delete document by ID",
343
+ "response": {"success": "bool", "message": "string"}
344
+ },
345
+ "GET /stats": {
346
+ "description": "Get Qdrant collection statistics",
347
+ "response": {"vectors_count": "int", "segments": "int", ...}
348
+ }
349
+ }
350
+ },
351
+ "quick_start": {
352
+ "1_upload_multimodal_pdf": "curl -X POST '/upload-pdf-multimodal' -F 'file=@user_guide.pdf' -F 'title=Guide'",
353
+ "2_verify_upload": "curl '/documents/pdf'",
354
+ "3_chat_with_rag": "curl -X POST '/chat' -H 'Content-Type: application/json' -d '{\"message\": \"How to...?\", \"use_advanced_rag\": true}'",
355
+ "4_see_images_in_context": "response['context_used'][0]['metadata']['image_urls']"
356
+ },
357
+ "use_cases": {
358
+ "user_guide_with_screenshots": {
359
+ "endpoint": "/upload-pdf-multimodal",
360
+ "description": "PDFs with text instructions + image URLs for visual guidance",
361
+ "benefits": ["Images linked to text chunks", "Chatbot returns relevant screenshots", "Perfect for step-by-step guides"]
362
+ },
363
+ "simple_text_docs": {
364
+ "endpoint": "/upload-pdf",
365
+ "description": "Simple PDFs with text only (FAQ, policies, etc.)"
366
+ },
367
+ "social_media_posts": {
368
+ "endpoint": "/index",
369
+ "description": "Index multiple posts with texts (up to 10) and images (up to 10)"
370
+ },
371
+ "complex_queries": {
372
+ "endpoint": "/chat",
373
+ "description": "Use advanced RAG for better accuracy on complex questions",
374
+ "settings": {"use_advanced_rag": True, "use_reranking": True, "use_compression": True}
375
+ }
376
+ },
377
+ "best_practices": {
378
+ "pdf_format": [
379
+ "Include image URLs in text (http://, https://)",
380
+ "Use markdown format: ![alt](url) or HTML: <img src='url'>",
381
+ "Clear structure with headings and sections",
382
+ "Link images close to their related text"
383
+ ],
384
+ "chat_settings": {
385
+ "for_accuracy": {"temperature": 0.3, "use_advanced_rag": True, "use_reranking": True},
386
+ "for_creativity": {"temperature": 0.8, "use_advanced_rag": False},
387
+ "for_factual_answers": {"temperature": 0.3, "use_compression": True, "score_threshold": 0.6}
388
+ },
389
+ "retrieval_tuning": {
390
+ "not_finding_info": "Lower score_threshold to 0.3-0.4, increase top_k to 7-10",
391
+ "too_much_context": "Increase score_threshold to 0.6-0.7, decrease top_k to 3-5",
392
+ "slow_responses": "Disable compression, use basic RAG, decrease top_k"
393
+ }
394
+ },
395
+ "links": {
396
+ "docs": "http://localhost:8000/docs",
397
+ "redoc": "http://localhost:8000/redoc",
398
+ "openapi": "http://localhost:8000/openapi.json",
399
+ "guides": {
400
+ "multimodal_pdf": "See MULTIMODAL_PDF_GUIDE.md",
401
+ "advanced_rag": "See ADVANCED_RAG_GUIDE.md",
402
+ "pdf_general": "See PDF_RAG_GUIDE.md",
403
+ "quick_start": "See QUICK_START_PDF.md"
404
+ }
405
+ },
406
+ "system_info": {
407
+ "embedding_model": "Jina CLIP v2 (multimodal)",
408
+ "vector_db": "Qdrant with HNSW index",
409
+ "document_db": "MongoDB",
410
+ "rag_pipeline": "Advanced RAG with query expansion, reranking, compression",
411
+ "pdf_parser": "pypdfium2 with URL extraction",
412
+ "max_inputs": "10 texts + 10 images per /index request"
413
+ }
414
+ }
415
+
416
+ @app.post("/index", response_model=IndexResponse)
417
+ async def index_data(
418
+ id: str = Form(...),
419
+ texts: Optional[List[str]] = Form(None),
420
+ images: Optional[List[UploadFile]] = File(None)
421
+ ):
422
+ """
423
+ Index data vào vector database (hỗ trợ nhiều texts và images)
424
+
425
+ Body:
426
+ - id: Document ID (event ID, post ID, etc.)
427
+ - texts: List of text contents (tiếng Việt supported) - Tối đa 10 texts
428
+ - images: List of image files (optional) - Tối đa 10 images
429
+
430
+ Returns:
431
+ - success: True/False
432
+ - id: Document ID
433
+ - message: Status message
434
+ """
435
+ try:
436
+ # Validation
437
+ if texts is None and images is None:
438
+ raise HTTPException(status_code=400, detail="Phải cung cấp ít nhất texts hoặc images")
439
+
440
+ if texts and len(texts) > 10:
441
+ raise HTTPException(status_code=400, detail="Tối đa 10 texts")
442
+
443
+ if images and len(images) > 10:
444
+ raise HTTPException(status_code=400, detail="Tối đa 10 images")
445
+
446
+ # Prepare embeddings
447
+ text_embeddings = []
448
+ image_embeddings = []
449
+
450
+ # Encode multiple texts (tiếng Việt)
451
+ if texts:
452
+ for text in texts:
453
+ if text and text.strip():
454
+ text_emb = embedding_service.encode_text(text)
455
+ text_embeddings.append(text_emb)
456
+
457
+ # Encode multiple images
458
+ if images:
459
+ for image in images:
460
+ if image.filename: # Check if image is provided
461
+ image_bytes = await image.read()
462
+ pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
463
+ image_emb = embedding_service.encode_image(pil_image)
464
+ image_embeddings.append(image_emb)
465
+
466
+ # Combine embeddings
467
+ all_embeddings = []
468
+
469
+ if text_embeddings:
470
+ # Average all text embeddings
471
+ avg_text_embedding = np.mean(text_embeddings, axis=0)
472
+ all_embeddings.append(avg_text_embedding)
473
+
474
+ if image_embeddings:
475
+ # Average all image embeddings
476
+ avg_image_embedding = np.mean(image_embeddings, axis=0)
477
+ all_embeddings.append(avg_image_embedding)
478
+
479
+ if not all_embeddings:
480
+ raise HTTPException(status_code=400, detail="Không có embedding nào được tạo từ texts hoặc images")
481
+
482
+ # Final combined embedding
483
+ combined_embedding = np.mean(all_embeddings, axis=0)
484
+
485
+ # Normalize
486
+ combined_embedding = combined_embedding / np.linalg.norm(combined_embedding, axis=1, keepdims=True)
487
+
488
+ # Index vào Qdrant
489
+ metadata = {
490
+ "texts": texts if texts else [],
491
+ "text_count": len(texts) if texts else 0,
492
+ "image_count": len(images) if images else 0,
493
+ "image_filenames": [img.filename for img in images] if images else []
494
+ }
495
+
496
+ result = qdrant_service.index_data(
497
+ doc_id=id,
498
+ embedding=combined_embedding,
499
+ metadata=metadata
500
+ )
501
+
502
+ return IndexResponse(
503
+ success=True,
504
+ id=result["original_id"], # Trả về MongoDB ObjectId
505
+ message=f"Đã index thành công document {result['original_id']} với {len(texts) if texts else 0} texts và {len(images) if images else 0} images (Qdrant UUID: {result['qdrant_id']})"
506
+ )
507
+
508
+ except HTTPException:
509
+ raise
510
+ except Exception as e:
511
+ raise HTTPException(status_code=500, detail=f"Lỗi khi index: {str(e)}")
512
+
513
+
514
+ @app.post("/search", response_model=List[SearchResponse])
515
+ async def search(
516
+ text: Optional[str] = Form(None),
517
+ image: Optional[UploadFile] = File(None),
518
+ limit: int = Form(10),
519
+ score_threshold: Optional[float] = Form(None),
520
+ text_weight: float = Form(0.5),
521
+ image_weight: float = Form(0.5)
522
+ ):
523
+ """
524
+ Search similar documents bằng text và/hoặc image
525
+
526
+ Body:
527
+ - text: Query text (tiếng Việt supported)
528
+ - image: Query image (optional)
529
+ - limit: Số lượng kết quả (default: 10)
530
+ - score_threshold: Minimum confidence score (0-1)
531
+ - text_weight: Weight cho text search (default: 0.5)
532
+ - image_weight: Weight cho image search (default: 0.5)
533
+
534
+ Returns:
535
+ - List of results với id, confidence, và metadata
536
+ """
537
+ try:
538
+ # Prepare query embeddings
539
+ text_embedding = None
540
+ image_embedding = None
541
+
542
+ # Encode text query
543
+ if text and text.strip():
544
+ text_embedding = embedding_service.encode_text(text)
545
+
546
+ # Encode image query
547
+ if image:
548
+ image_bytes = await image.read()
549
+ pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
550
+ image_embedding = embedding_service.encode_image(pil_image)
551
+
552
+ # Validate input
553
+ if text_embedding is None and image_embedding is None:
554
+ raise HTTPException(status_code=400, detail="Phải cung cấp ít nhất text hoặc image để search")
555
+
556
+ # Hybrid search với Qdrant
557
+ results = qdrant_service.hybrid_search(
558
+ text_embedding=text_embedding,
559
+ image_embedding=image_embedding,
560
+ text_weight=text_weight,
561
+ image_weight=image_weight,
562
+ limit=limit,
563
+ score_threshold=score_threshold,
564
+ ef=256 # High accuracy search
565
+ )
566
+
567
+ # Format response
568
+ return [
569
+ SearchResponse(
570
+ id=result["id"],
571
+ confidence=result["confidence"],
572
+ metadata=result["metadata"]
573
+ )
574
+ for result in results
575
+ ]
576
+
577
+ except Exception as e:
578
+ raise HTTPException(status_code=500, detail=f"Lỗi khi search: {str(e)}")
579
+
580
+
581
+ @app.post("/search/text", response_model=List[SearchResponse])
582
+ async def search_by_text(
583
+ text: str = Form(...),
584
+ limit: int = Form(10),
585
+ score_threshold: Optional[float] = Form(None)
586
+ ):
587
+ """
588
+ Search chỉ bằng text (tiếng Việt)
589
+
590
+ Body:
591
+ - text: Query text (tiếng Việt)
592
+ - limit: Số lượng kết quả
593
+ - score_threshold: Minimum confidence score
594
+
595
+ Returns:
596
+ - List of results
597
+ """
598
+ try:
599
+ # Encode text
600
+ text_embedding = embedding_service.encode_text(text)
601
+
602
+ # Search
603
+ results = qdrant_service.search(
604
+ query_embedding=text_embedding,
605
+ limit=limit,
606
+ score_threshold=score_threshold,
607
+ ef=256
608
+ )
609
+
610
+ return [
611
+ SearchResponse(
612
+ id=result["id"],
613
+ confidence=result["confidence"],
614
+ metadata=result["metadata"]
615
+ )
616
+ for result in results
617
+ ]
618
+
619
+ except Exception as e:
620
+ raise HTTPException(status_code=500, detail=f"Lỗi khi search: {str(e)}")
621
+
622
+
623
+ @app.post("/search/image", response_model=List[SearchResponse])
624
+ async def search_by_image(
625
+ image: UploadFile = File(...),
626
+ limit: int = Form(10),
627
+ score_threshold: Optional[float] = Form(None)
628
+ ):
629
+ """
630
+ Search chỉ bằng image
631
+
632
+ Body:
633
+ - image: Query image
634
+ - limit: Số lượng kết quả
635
+ - score_threshold: Minimum confidence score
636
+
637
+ Returns:
638
+ - List of results
639
+ """
640
+ try:
641
+ # Encode image
642
+ image_bytes = await image.read()
643
+ pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
644
+ image_embedding = embedding_service.encode_image(pil_image)
645
+
646
+ # Search
647
+ results = qdrant_service.search(
648
+ query_embedding=image_embedding,
649
+ limit=limit,
650
+ score_threshold=score_threshold,
651
+ ef=256
652
+ )
653
+
654
+ return [
655
+ SearchResponse(
656
+ id=result["id"],
657
+ confidence=result["confidence"],
658
+ metadata=result["metadata"]
659
+ )
660
+ for result in results
661
+ ]
662
+
663
+ except Exception as e:
664
+ raise HTTPException(status_code=500, detail=f"Lỗi khi search: {str(e)}")
665
+
666
+
667
+ @app.delete("/delete/{doc_id}")
668
+ async def delete_document(doc_id: str):
669
+ """
670
+ Delete document by ID (MongoDB ObjectId hoặc UUID)
671
+
672
+ Args:
673
+ - doc_id: Document ID to delete
674
+
675
+ Returns:
676
+ - Success message
677
+ """
678
+ try:
679
+ qdrant_service.delete_by_id(doc_id)
680
+ return {"success": True, "message": f"Đã xóa document {doc_id}"}
681
+ except Exception as e:
682
+ raise HTTPException(status_code=500, detail=f"Lỗi khi xóa: {str(e)}")
683
+
684
+
685
+ @app.get("/document/{doc_id}")
686
+ async def get_document(doc_id: str):
687
+ """
688
+ Get document by ID (MongoDB ObjectId hoặc UUID)
689
+
690
+ Args:
691
+ - doc_id: Document ID (MongoDB ObjectId)
692
+
693
+ Returns:
694
+ - Document data
695
+ """
696
+ try:
697
+ doc = qdrant_service.get_by_id(doc_id)
698
+ if doc:
699
+ return {
700
+ "success": True,
701
+ "data": doc
702
+ }
703
+ raise HTTPException(status_code=404, detail=f"Không tìm thấy document {doc_id}")
704
+ except HTTPException:
705
+ raise
706
+ except Exception as e:
707
+ raise HTTPException(status_code=500, detail=f"Lỗi khi get document: {str(e)}")
708
+
709
+
710
+ @app.get("/stats")
711
+ async def get_stats():
712
+ """
713
+ Lấy thông tin thống kê collection
714
+
715
+ Returns:
716
+ - Collection statistics
717
+ """
718
+ try:
719
+ info = qdrant_service.get_collection_info()
720
+ return info
721
+ except Exception as e:
722
+ raise HTTPException(status_code=500, detail=f"Lỗi khi lấy stats: {str(e)}")
723
+
724
+
725
+ # ============================================
726
+ # ChatbotRAG Endpoints
727
+ # ============================================
728
+
729
+ @app.post("/chat", response_model=ChatResponse)
730
+ async def chat(request: ChatRequest):
731
+ """
732
+ Chat endpoint với Advanced RAG
733
+
734
+ Body:
735
+ - message: User message
736
+ - use_rag: Enable RAG retrieval (default: true)
737
+ - top_k: Number of documents to retrieve (default: 3)
738
+ - system_message: System prompt (optional)
739
+ - max_tokens: Max tokens for response (default: 512)
740
+ - temperature: Temperature for generation (default: 0.7)
741
+ - hf_token: Hugging Face token (optional, sẽ dùng env nếu không truyền)
742
+ - use_advanced_rag: Use advanced RAG pipeline (default: true)
743
+ - use_query_expansion: Enable query expansion (default: true)
744
+ - use_reranking: Enable reranking (default: true)
745
+ - use_compression: Enable context compression (default: true)
746
+ - score_threshold: Minimum relevance score (default: 0.5)
747
+
748
+ Returns:
749
+ - response: Generated response
750
+ - context_used: Retrieved context documents
751
+ - timestamp: Response timestamp
752
+ - rag_stats: Statistics from RAG pipeline
753
+ """
754
+ try:
755
+ # Retrieve context if RAG enabled
756
+ context_used = []
757
+ rag_stats = None
758
+
759
+ if request.use_rag:
760
+ if request.use_advanced_rag:
761
+ # Use Advanced RAG Pipeline
762
+ documents, stats = advanced_rag.hybrid_rag_pipeline(
763
+ query=request.message,
764
+ top_k=request.top_k,
765
+ score_threshold=request.score_threshold,
766
+ use_reranking=request.use_reranking,
767
+ use_compression=request.use_compression,
768
+ max_context_tokens=500
769
+ )
770
+
771
+ # Convert to dict format for compatibility
772
+ context_used = [
773
+ {
774
+ "id": doc.id,
775
+ "confidence": doc.confidence,
776
+ "metadata": doc.metadata
777
+ }
778
+ for doc in documents
779
+ ]
780
+ rag_stats = stats
781
+
782
+ # Format context using advanced RAG formatter
783
+ context_text = advanced_rag.format_context_for_llm(documents)
784
+
785
+ else:
786
+ # Use basic RAG (original implementation)
787
+ query_embedding = embedding_service.encode_text(request.message)
788
+
789
+ results = qdrant_service.search(
790
+ query_embedding=query_embedding,
791
+ limit=request.top_k,
792
+ score_threshold=request.score_threshold
793
+ )
794
+ context_used = results
795
+
796
+ # Build context text (basic format)
797
+ context_text = "\n\nRelevant Context:\n"
798
+ for i, doc in enumerate(context_used, 1):
799
+ doc_text = doc["metadata"].get("text", "")
800
+ confidence = doc["confidence"]
801
+ context_text += f"\n[{i}] (Confidence: {confidence:.2f})\n{doc_text}\n"
802
+
803
+ # Build system message with context
804
+ if request.use_rag and context_used:
805
+ if request.use_advanced_rag:
806
+ # Use advanced prompt builder
807
+ system_message = advanced_rag.build_rag_prompt(
808
+ query=request.message,
809
+ context=context_text,
810
+ system_message=request.system_message
811
+ )
812
+ else:
813
+ # Basic prompt
814
+ system_message = f"{request.system_message}\n{context_text}\n\nPlease use the above context to answer the user's question when relevant."
815
+ else:
816
+ system_message = request.system_message
817
+
818
+ # Use token from request or fallback to env
819
+ token = request.hf_token or hf_token
820
+ # Generate response
821
+ if not token:
822
+ response = f"""[LLM Response Placeholder]
823
+
824
+ Context retrieved: {len(context_used)} documents
825
+ User question: {request.message}
826
+
827
+ To enable actual LLM generation:
828
+ 1. Set HUGGINGFACE_TOKEN environment variable, OR
829
+ 2. Pass hf_token in request body
830
+
831
+ Example:
832
+ {{
833
+ "message": "Your question",
834
+ "hf_token": "hf_xxxxxxxxxxxxx"
835
+ }}
836
+ """
837
+ else:
838
+ try:
839
+ client = InferenceClient(
840
+ token=hf_token,
841
+ model="openai/gpt-oss-20b"
842
+ )
843
+
844
+ # Build messages
845
+ messages = [
846
+ {"role": "system", "content": system_message},
847
+ {"role": "user", "content": request.message}
848
+ ]
849
+
850
+ # Generate response
851
+ response = ""
852
+ for msg in client.chat_completion(
853
+ messages,
854
+ max_tokens=request.max_tokens,
855
+ stream=True,
856
+ temperature=request.temperature,
857
+ top_p=request.top_p,
858
+ ):
859
+ choices = msg.choices
860
+ if len(choices) and choices[0].delta.content:
861
+ response += choices[0].delta.content
862
+
863
+ except Exception as e:
864
+ response = f"Error generating response with LLM: {str(e)}\n\nContext was retrieved successfully, but LLM generation failed."
865
+
866
+ # Save to history
867
+ chat_data = {
868
+ "user_message": request.message,
869
+ "assistant_response": response,
870
+ "context_used": context_used,
871
+ "timestamp": datetime.utcnow()
872
+ }
873
+ chat_history_collection.insert_one(chat_data)
874
+
875
+ return ChatResponse(
876
+ response=response,
877
+ context_used=context_used,
878
+ timestamp=datetime.utcnow().isoformat(),
879
+ rag_stats=rag_stats
880
+ )
881
+
882
+ except Exception as e:
883
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
884
+
885
+
886
+ @app.post("/documents", response_model=AddDocumentResponse)
887
+ async def add_document(request: AddDocumentRequest):
888
+ """
889
+ Add document to knowledge base
890
+
891
+ Body:
892
+ - text: Document text
893
+ - metadata: Additional metadata (optional)
894
+
895
+ Returns:
896
+ - success: True/False
897
+ - doc_id: MongoDB document ID
898
+ - message: Status message
899
+ """
900
+ try:
901
+ # Save to MongoDB
902
+ doc_data = {
903
+ "text": request.text,
904
+ "metadata": request.metadata or {},
905
+ "created_at": datetime.utcnow()
906
+ }
907
+ result = documents_collection.insert_one(doc_data)
908
+ doc_id = str(result.inserted_id)
909
+
910
+ # Generate embedding
911
+ embedding = embedding_service.encode_text(request.text)
912
+
913
+ # Index to Qdrant
914
+ qdrant_service.index_data(
915
+ doc_id=doc_id,
916
+ embedding=embedding,
917
+ metadata={
918
+ "text": request.text,
919
+ "source": "api",
920
+ **(request.metadata or {})
921
+ }
922
+ )
923
+
924
+ return AddDocumentResponse(
925
+ success=True,
926
+ doc_id=doc_id,
927
+ message=f"Document added successfully with ID: {doc_id}"
928
+ )
929
+
930
+ except Exception as e:
931
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
932
+
933
+
934
+ @app.post("/rag/search", response_model=List[SearchResponse])
935
+ async def rag_search(
936
+ query: str = Form(...),
937
+ top_k: int = Form(5),
938
+ score_threshold: Optional[float] = Form(0.5)
939
+ ):
940
+ """
941
+ Search in knowledge base
942
+
943
+ Body:
944
+ - query: Search query
945
+ - top_k: Number of results (default: 5)
946
+ - score_threshold: Minimum score (default: 0.5)
947
+
948
+ Returns:
949
+ - results: List of matching documents
950
+ """
951
+ try:
952
+ # Generate query embedding
953
+ query_embedding = embedding_service.encode_text(query)
954
+
955
+ # Search in Qdrant
956
+ results = qdrant_service.search(
957
+ query_embedding=query_embedding,
958
+ limit=top_k,
959
+ score_threshold=score_threshold
960
+ )
961
+
962
+ return [
963
+ SearchResponse(
964
+ id=result["id"],
965
+ confidence=result["confidence"],
966
+ metadata=result["metadata"]
967
+ )
968
+ for result in results
969
+ ]
970
+
971
+ except Exception as e:
972
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
973
+
974
+
975
+ @app.get("/history")
976
+ async def get_history(limit: int = 10, skip: int = 0):
977
+ """
978
+ Get chat history
979
+
980
+ Query params:
981
+ - limit: Number of messages to return (default: 10)
982
+ - skip: Number of messages to skip (default: 0)
983
+
984
+ Returns:
985
+ - history: List of chat messages
986
+ """
987
+ try:
988
+ history = list(
989
+ chat_history_collection
990
+ .find({}, {"_id": 0})
991
+ .sort("timestamp", -1)
992
+ .skip(skip)
993
+ .limit(limit)
994
+ )
995
+
996
+ # Convert datetime to string
997
+ for msg in history:
998
+ if "timestamp" in msg:
999
+ msg["timestamp"] = msg["timestamp"].isoformat()
1000
+
1001
+ return {
1002
+ "history": history,
1003
+ "total": chat_history_collection.count_documents({})
1004
+ }
1005
+
1006
+ except Exception as e:
1007
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
1008
+
1009
+
1010
+ @app.delete("/documents/{doc_id}")
1011
+ async def delete_document_from_kb(doc_id: str):
1012
+ """
1013
+ Delete document from knowledge base
1014
+
1015
+ Args:
1016
+ - doc_id: Document ID (MongoDB ObjectId)
1017
+
1018
+ Returns:
1019
+ - success: True/False
1020
+ - message: Status message
1021
+ """
1022
+ try:
1023
+ # Delete from MongoDB
1024
+ result = documents_collection.delete_one({"_id": doc_id})
1025
+
1026
+ # Delete from Qdrant
1027
+ if result.deleted_count > 0:
1028
+ qdrant_service.delete_by_id(doc_id)
1029
+ return {"success": True, "message": f"Document {doc_id} deleted from knowledge base"}
1030
+ else:
1031
+ raise HTTPException(status_code=404, detail=f"Document {doc_id} not found")
1032
+
1033
+ except HTTPException:
1034
+ raise
1035
+ except Exception as e:
1036
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
1037
+
1038
+
1039
+ @app.post("/upload-pdf", response_model=UploadPDFResponse)
1040
+ async def upload_pdf(
1041
+ file: UploadFile = File(...),
1042
+ document_id: Optional[str] = Form(None),
1043
+ title: Optional[str] = Form(None),
1044
+ description: Optional[str] = Form(None),
1045
+ category: Optional[str] = Form(None)
1046
+ ):
1047
+ """
1048
+ Upload and index PDF file into knowledge base
1049
+
1050
+ Body (multipart/form-data):
1051
+ - file: PDF file (required)
1052
+ - document_id: Custom document ID (optional, auto-generated if not provided)
1053
+ - title: Document title (optional)
1054
+ - description: Document description (optional)
1055
+ - category: Document category (optional, e.g., "user_guide", "faq")
1056
+
1057
+ Returns:
1058
+ - success: True/False
1059
+ - document_id: Document ID
1060
+ - filename: Original filename
1061
+ - chunks_indexed: Number of chunks created
1062
+ - message: Status message
1063
+
1064
+ Example:
1065
+ ```bash
1066
+ curl -X POST "http://localhost:8000/upload-pdf" \
1067
+ -F "file=@user_guide.pdf" \
1068
+ -F "title=Hướng dẫn sử dụng ChatbotRAG" \
1069
+ -F "category=user_guide"
1070
+ ```
1071
+ """
1072
+ try:
1073
+ # Validate file type
1074
+ if not file.filename.endswith('.pdf'):
1075
+ raise HTTPException(status_code=400, detail="Only PDF files are allowed")
1076
+
1077
+ # Generate document ID if not provided
1078
+ if not document_id:
1079
+ from datetime import datetime
1080
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
1081
+ document_id = f"pdf_{timestamp}"
1082
+
1083
+ # Read PDF bytes
1084
+ pdf_bytes = await file.read()
1085
+
1086
+ # Prepare metadata
1087
+ metadata = {}
1088
+ if title:
1089
+ metadata['title'] = title
1090
+ if description:
1091
+ metadata['description'] = description
1092
+ if category:
1093
+ metadata['category'] = category
1094
+
1095
+ # Index PDF
1096
+ result = pdf_indexer.index_pdf_bytes(
1097
+ pdf_bytes=pdf_bytes,
1098
+ document_id=document_id,
1099
+ filename=file.filename,
1100
+ document_metadata=metadata
1101
+ )
1102
+
1103
+ return UploadPDFResponse(
1104
+ success=True,
1105
+ document_id=result['document_id'],
1106
+ filename=result['filename'],
1107
+ chunks_indexed=result['chunks_indexed'],
1108
+ message=f"PDF '{file.filename}' đã được index thành công với {result['chunks_indexed']} chunks"
1109
+ )
1110
+
1111
+ except HTTPException:
1112
+ raise
1113
+ except Exception as e:
1114
+ raise HTTPException(status_code=500, detail=f"Error uploading PDF: {str(e)}")
1115
+
1116
+
1117
+ @app.get("/documents/pdf")
1118
+ async def list_pdf_documents():
1119
+ """
1120
+ List all PDF documents in knowledge base
1121
+
1122
+ Returns:
1123
+ - documents: List of PDF documents with metadata
1124
+ """
1125
+ try:
1126
+ docs = list(documents_collection.find(
1127
+ {"type": "pdf"},
1128
+ {"_id": 0}
1129
+ ))
1130
+ return {"documents": docs, "total": len(docs)}
1131
+ except Exception as e:
1132
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
1133
+
1134
+
1135
+ @app.delete("/documents/pdf/{document_id}")
1136
+ async def delete_pdf_document(document_id: str):
1137
+ """
1138
+ Delete PDF document and all its chunks from knowledge base
1139
+
1140
+ Args:
1141
+ - document_id: Document ID
1142
+
1143
+ Returns:
1144
+ - success: True/False
1145
+ - message: Status message
1146
+ """
1147
+ try:
1148
+ # Get document info
1149
+ doc = documents_collection.find_one({"document_id": document_id, "type": "pdf"})
1150
+
1151
+ if not doc:
1152
+ raise HTTPException(status_code=404, detail=f"PDF document {document_id} not found")
1153
+
1154
+ # Delete all chunks from Qdrant
1155
+ chunk_ids = doc.get('chunk_ids', [])
1156
+ for chunk_id in chunk_ids:
1157
+ try:
1158
+ qdrant_service.delete_by_id(chunk_id)
1159
+ except:
1160
+ pass # Chunk might already be deleted
1161
+
1162
+ # Delete from MongoDB
1163
+ documents_collection.delete_one({"document_id": document_id})
1164
+
1165
+ return {
1166
+ "success": True,
1167
+ "message": f"PDF document {document_id} and {len(chunk_ids)} chunks deleted"
1168
+ }
1169
+
1170
+ except HTTPException:
1171
+ raise
1172
+ except Exception as e:
1173
+ raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
1174
+
1175
+
1176
+ @app.post("/upload-pdf-multimodal", response_model=UploadPDFResponse)
1177
+ async def upload_pdf_multimodal(
1178
+ file: UploadFile = File(...),
1179
+ document_id: Optional[str] = Form(None),
1180
+ title: Optional[str] = Form(None),
1181
+ description: Optional[str] = Form(None),
1182
+ category: Optional[str] = Form(None)
1183
+ ):
1184
+ """
1185
+ Upload PDF with text and image URLs (for user guides with screenshots)
1186
+
1187
+ This endpoint is optimized for PDFs containing:
1188
+ - Text instructions
1189
+ - Image URLs (http://... or https://...)
1190
+ - Markdown images: ![alt](url)
1191
+ - HTML images: <img src="url">
1192
+
1193
+ The system will:
1194
+ 1. Extract text from PDF
1195
+ 2. Detect all image URLs in the text
1196
+ 3. Link images to their corresponding text chunks
1197
+ 4. Store image URLs in metadata
1198
+ 5. Return images along with text during chat
1199
+
1200
+ Body (multipart/form-data):
1201
+ - file: PDF file (required)
1202
+ - document_id: Custom document ID (optional, auto-generated if not provided)
1203
+ - title: Document title (optional)
1204
+ - description: Document description (optional)
1205
+ - category: Document category (optional, e.g., "user_guide", "tutorial")
1206
+
1207
+ Returns:
1208
+ - success: True/False
1209
+ - document_id: Document ID
1210
+ - filename: Original filename
1211
+ - chunks_indexed: Number of chunks created
1212
+ - message: Status message (includes image count)
1213
+
1214
+ Example:
1215
+ ```bash
1216
+ curl -X POST "http://localhost:8000/upload-pdf-multimodal" \
1217
+ -F "file=@user_guide_with_images.pdf" \
1218
+ -F "title=Hướng dẫn có ảnh minh họa" \
1219
+ -F "category=user_guide"
1220
+ ```
1221
+
1222
+ Example Response:
1223
+ ```json
1224
+ {
1225
+ "success": true,
1226
+ "document_id": "pdf_20251029_150000",
1227
+ "filename": "user_guide_with_images.pdf",
1228
+ "chunks_indexed": 25,
1229
+ "message": "PDF 'user_guide_with_images.pdf' indexed with 25 chunks and 15 images"
1230
+ }
1231
+ ```
1232
+ """
1233
+ try:
1234
+ # Validate file type
1235
+ if not file.filename.endswith('.pdf'):
1236
+ raise HTTPException(status_code=400, detail="Only PDF files are allowed")
1237
+
1238
+ # Generate document ID if not provided
1239
+ if not document_id:
1240
+ from datetime import datetime
1241
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
1242
+ document_id = f"pdf_multimodal_{timestamp}"
1243
+
1244
+ # Read PDF bytes
1245
+ pdf_bytes = await file.read()
1246
+
1247
+ # Prepare metadata
1248
+ metadata = {'type': 'multimodal'}
1249
+ if title:
1250
+ metadata['title'] = title
1251
+ if description:
1252
+ metadata['description'] = description
1253
+ if category:
1254
+ metadata['category'] = category
1255
+
1256
+ # Index PDF with multimodal parser
1257
+ result = multimodal_pdf_indexer.index_pdf_bytes(
1258
+ pdf_bytes=pdf_bytes,
1259
+ document_id=document_id,
1260
+ filename=file.filename,
1261
+ document_metadata=metadata
1262
+ )
1263
+
1264
+ return UploadPDFResponse(
1265
+ success=True,
1266
+ document_id=result['document_id'],
1267
+ filename=result['filename'],
1268
+ chunks_indexed=result['chunks_indexed'],
1269
+ message=f"PDF '{file.filename}' indexed successfully with {result['chunks_indexed']} chunks and {result.get('images_found', 0)} images"
1270
+ )
1271
+
1272
+ except HTTPException:
1273
+ raise
1274
+ except Exception as e:
1275
+ raise HTTPException(status_code=500, detail=f"Error uploading multimodal PDF: {str(e)}")
1276
+
1277
+
1278
+ if __name__ == "__main__":
1279
+ import uvicorn
1280
+ uvicorn.run(
1281
+ app,
1282
+ host="0.0.0.0",
1283
+ port=8000,
1284
+ log_level="info"
1285
+ )
multimodal_pdf_parser.py ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Enhanced Multimodal PDF Parser for PDFs with Text + Image URLs
3
+ Extracts text, detects image URLs, and links them together
4
+ """
5
+
6
+ import pypdfium2 as pdfium
7
+ from typing import List, Dict, Optional, Tuple
8
+ import re
9
+ from dataclasses import dataclass, field
10
+
11
+
12
+ @dataclass
13
+ class MultimodalChunk:
14
+ """Represents a chunk with text and associated images"""
15
+ text: str
16
+ page_number: int
17
+ chunk_index: int
18
+ image_urls: List[str] = field(default_factory=list)
19
+ metadata: Dict = field(default_factory=dict)
20
+
21
+
22
+ class MultimodalPDFParser:
23
+ """
24
+ Enhanced PDF Parser that extracts text and image URLs
25
+ Perfect for user guides with screenshots and visual instructions
26
+ """
27
+
28
+ def __init__(
29
+ self,
30
+ chunk_size: int = 500,
31
+ chunk_overlap: int = 50,
32
+ min_chunk_size: int = 50,
33
+ extract_images: bool = True
34
+ ):
35
+ self.chunk_size = chunk_size
36
+ self.chunk_overlap = chunk_overlap
37
+ self.min_chunk_size = min_chunk_size
38
+ self.extract_images = extract_images
39
+
40
+ # URL patterns
41
+ self.url_patterns = [
42
+ # Standard URLs
43
+ r'https?://[^\s<>"{}|\\^`\[\]]+',
44
+ # Markdown images: ![alt](url)
45
+ r'!\[.*?\]\((https?://[^\s)]+)\)',
46
+ # HTML images: <img src="url">
47
+ r'<img[^>]+src=["\']([^"\']+)["\']',
48
+ # Direct image extensions
49
+ r'https?://[^\s<>"{}|\\^`\[\]]+\.(?:jpg|jpeg|png|gif|bmp|svg|webp)',
50
+ ]
51
+
52
+ def extract_image_urls(self, text: str) -> List[str]:
53
+ """
54
+ Extract all image URLs from text
55
+
56
+ Args:
57
+ text: Text content
58
+
59
+ Returns:
60
+ List of image URLs found
61
+ """
62
+ urls = []
63
+
64
+ for pattern in self.url_patterns:
65
+ matches = re.findall(pattern, text, re.IGNORECASE)
66
+ urls.extend(matches)
67
+
68
+ # Remove duplicates while preserving order
69
+ seen = set()
70
+ unique_urls = []
71
+ for url in urls:
72
+ if url not in seen:
73
+ seen.add(url)
74
+ unique_urls.append(url)
75
+
76
+ return unique_urls
77
+
78
+ def extract_text_from_pdf(self, pdf_path: str) -> Dict[int, Tuple[str, List[str]]]:
79
+ """
80
+ Extract text and image URLs from PDF
81
+
82
+ Args:
83
+ pdf_path: Path to PDF file
84
+
85
+ Returns:
86
+ Dictionary mapping page number to (text, image_urls) tuple
87
+ """
88
+ pdf_pages = {}
89
+
90
+ try:
91
+ pdf = pdfium.PdfDocument(pdf_path)
92
+
93
+ for page_num in range(len(pdf)):
94
+ page = pdf[page_num]
95
+ textpage = page.get_textpage()
96
+ text = textpage.get_text_range()
97
+
98
+ # Clean text
99
+ text = self._clean_text(text)
100
+
101
+ # Extract image URLs if enabled
102
+ image_urls = []
103
+ if self.extract_images:
104
+ image_urls = self.extract_image_urls(text)
105
+
106
+ pdf_pages[page_num + 1] = (text, image_urls)
107
+
108
+ return pdf_pages
109
+
110
+ except Exception as e:
111
+ raise Exception(f"Error reading PDF: {str(e)}")
112
+
113
+ def _clean_text(self, text: str) -> str:
114
+ """Clean extracted text"""
115
+ # Remove excessive whitespace
116
+ text = re.sub(r'\s+', ' ', text)
117
+ # Remove special characters
118
+ text = text.replace('\x00', '')
119
+ return text.strip()
120
+
121
+ def chunk_text_with_images(
122
+ self,
123
+ text: str,
124
+ image_urls: List[str],
125
+ page_number: int
126
+ ) -> List[MultimodalChunk]:
127
+ """
128
+ Split text into chunks and associate images with relevant chunks
129
+
130
+ Args:
131
+ text: Text to chunk
132
+ image_urls: Image URLs from the page
133
+ page_number: Page number
134
+
135
+ Returns:
136
+ List of MultimodalChunk objects
137
+ """
138
+ # Split into words
139
+ words = text.split()
140
+
141
+ if len(words) < self.min_chunk_size:
142
+ if len(words) > 0:
143
+ return [MultimodalChunk(
144
+ text=text,
145
+ page_number=page_number,
146
+ chunk_index=0,
147
+ image_urls=image_urls, # All images go to single chunk
148
+ metadata={'page': page_number, 'chunk': 0}
149
+ )]
150
+ return []
151
+
152
+ chunks = []
153
+ chunk_index = 0
154
+ start = 0
155
+
156
+ # Calculate how to distribute images across chunks
157
+ images_per_chunk = len(image_urls) // max(1, len(words) // self.chunk_size) if image_urls else 0
158
+ image_index = 0
159
+
160
+ while start < len(words):
161
+ end = min(start + self.chunk_size, len(words))
162
+ chunk_words = words[start:end]
163
+ chunk_text = ' '.join(chunk_words)
164
+
165
+ # Assign images to this chunk
166
+ chunk_images = []
167
+ if image_urls:
168
+ # Simple strategy: distribute images evenly
169
+ # or detect if URL appears in chunk text
170
+ for url in image_urls:
171
+ if url in chunk_text:
172
+ chunk_images.append(url)
173
+
174
+ # If no URLs found in text, distribute evenly
175
+ if not chunk_images and image_index < len(image_urls):
176
+ # Assign remaining images to chunks
177
+ num_imgs = min(images_per_chunk + 1, len(image_urls) - image_index)
178
+ chunk_images = image_urls[image_index:image_index + num_imgs]
179
+ image_index += num_imgs
180
+
181
+ chunks.append(MultimodalChunk(
182
+ text=chunk_text,
183
+ page_number=page_number,
184
+ chunk_index=chunk_index,
185
+ image_urls=chunk_images,
186
+ metadata={
187
+ 'page': page_number,
188
+ 'chunk': chunk_index,
189
+ 'start_word': start,
190
+ 'end_word': end,
191
+ 'has_images': len(chunk_images) > 0,
192
+ 'num_images': len(chunk_images)
193
+ }
194
+ ))
195
+
196
+ chunk_index += 1
197
+ start = end - self.chunk_overlap
198
+
199
+ if start >= len(words) - self.min_chunk_size:
200
+ break
201
+
202
+ return chunks
203
+
204
+ def parse_pdf(
205
+ self,
206
+ pdf_path: str,
207
+ document_metadata: Optional[Dict] = None
208
+ ) -> List[MultimodalChunk]:
209
+ """
210
+ Parse PDF into multimodal chunks
211
+
212
+ Args:
213
+ pdf_path: Path to PDF file
214
+ document_metadata: Additional metadata
215
+
216
+ Returns:
217
+ List of MultimodalChunk objects
218
+ """
219
+ pages_data = self.extract_text_from_pdf(pdf_path)
220
+
221
+ all_chunks = []
222
+ for page_num, (text, image_urls) in pages_data.items():
223
+ chunks = self.chunk_text_with_images(text, image_urls, page_num)
224
+
225
+ # Add document metadata
226
+ if document_metadata:
227
+ for chunk in chunks:
228
+ chunk.metadata.update(document_metadata)
229
+
230
+ all_chunks.extend(chunks)
231
+
232
+ return all_chunks
233
+
234
+ def parse_pdf_bytes(
235
+ self,
236
+ pdf_bytes: bytes,
237
+ document_metadata: Optional[Dict] = None
238
+ ) -> List[MultimodalChunk]:
239
+ """Parse PDF from bytes"""
240
+ import tempfile
241
+ import os
242
+
243
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
244
+ tmp.write(pdf_bytes)
245
+ tmp_path = tmp.name
246
+
247
+ try:
248
+ chunks = self.parse_pdf(tmp_path, document_metadata)
249
+ return chunks
250
+ finally:
251
+ if os.path.exists(tmp_path):
252
+ os.unlink(tmp_path)
253
+
254
+
255
+ class MultimodalPDFIndexer:
256
+ """Index multimodal PDF chunks into RAG system"""
257
+
258
+ def __init__(self, embedding_service, qdrant_service, documents_collection):
259
+ self.embedding_service = embedding_service
260
+ self.qdrant_service = qdrant_service
261
+ self.documents_collection = documents_collection
262
+ self.parser = MultimodalPDFParser()
263
+
264
+ def index_pdf(
265
+ self,
266
+ pdf_path: str,
267
+ document_id: str,
268
+ document_metadata: Optional[Dict] = None
269
+ ) -> Dict:
270
+ """Index PDF with image URLs"""
271
+ chunks = self.parser.parse_pdf(pdf_path, document_metadata)
272
+
273
+ indexed_count = 0
274
+ chunk_ids = []
275
+ total_images = 0
276
+
277
+ for chunk in chunks:
278
+ chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
279
+
280
+ # Generate embedding (text-based)
281
+ embedding = self.embedding_service.encode_text(chunk.text)
282
+
283
+ # Prepare metadata with image URLs
284
+ metadata = {
285
+ 'text': chunk.text,
286
+ 'document_id': document_id,
287
+ 'page': chunk.page_number,
288
+ 'chunk_index': chunk.chunk_index,
289
+ 'source': 'pdf',
290
+ 'has_images': len(chunk.image_urls) > 0,
291
+ 'image_urls': chunk.image_urls, # Store image URLs!
292
+ 'num_images': len(chunk.image_urls),
293
+ **chunk.metadata
294
+ }
295
+
296
+ # Index to Qdrant
297
+ self.qdrant_service.index_data(
298
+ doc_id=chunk_id,
299
+ embedding=embedding,
300
+ metadata=metadata
301
+ )
302
+
303
+ chunk_ids.append(chunk_id)
304
+ indexed_count += 1
305
+ total_images += len(chunk.image_urls)
306
+
307
+ # Save document info
308
+ doc_info = {
309
+ 'document_id': document_id,
310
+ 'type': 'multimodal_pdf',
311
+ 'file_path': pdf_path,
312
+ 'num_chunks': indexed_count,
313
+ 'total_images': total_images,
314
+ 'chunk_ids': chunk_ids,
315
+ 'metadata': document_metadata or {}
316
+ }
317
+ self.documents_collection.insert_one(doc_info)
318
+
319
+ return {
320
+ 'success': True,
321
+ 'document_id': document_id,
322
+ 'chunks_indexed': indexed_count,
323
+ 'images_found': total_images,
324
+ 'chunk_ids': chunk_ids[:5]
325
+ }
326
+
327
+ def index_pdf_bytes(
328
+ self,
329
+ pdf_bytes: bytes,
330
+ document_id: str,
331
+ filename: str,
332
+ document_metadata: Optional[Dict] = None
333
+ ) -> Dict:
334
+ """Index PDF from bytes"""
335
+ metadata = document_metadata or {}
336
+ metadata['filename'] = filename
337
+
338
+ chunks = self.parser.parse_pdf_bytes(pdf_bytes, metadata)
339
+
340
+ indexed_count = 0
341
+ chunk_ids = []
342
+ total_images = 0
343
+
344
+ for chunk in chunks:
345
+ chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
346
+
347
+ embedding = self.embedding_service.encode_text(chunk.text)
348
+
349
+ metadata = {
350
+ 'text': chunk.text,
351
+ 'document_id': document_id,
352
+ 'page': chunk.page_number,
353
+ 'chunk_index': chunk.chunk_index,
354
+ 'source': 'multimodal_pdf',
355
+ 'filename': filename,
356
+ 'has_images': len(chunk.image_urls) > 0,
357
+ 'image_urls': chunk.image_urls,
358
+ 'num_images': len(chunk.image_urls),
359
+ **chunk.metadata
360
+ }
361
+
362
+ self.qdrant_service.index_data(
363
+ doc_id=chunk_id,
364
+ embedding=embedding,
365
+ metadata=metadata
366
+ )
367
+
368
+ chunk_ids.append(chunk_id)
369
+ indexed_count += 1
370
+ total_images += len(chunk.image_urls)
371
+
372
+ doc_info = {
373
+ 'document_id': document_id,
374
+ 'type': 'multimodal_pdf',
375
+ 'filename': filename,
376
+ 'num_chunks': indexed_count,
377
+ 'total_images': total_images,
378
+ 'chunk_ids': chunk_ids,
379
+ 'metadata': metadata
380
+ }
381
+ self.documents_collection.insert_one(doc_info)
382
+
383
+ return {
384
+ 'success': True,
385
+ 'document_id': document_id,
386
+ 'filename': filename,
387
+ 'chunks_indexed': indexed_count,
388
+ 'images_found': total_images,
389
+ 'chunk_ids': chunk_ids[:5]
390
+ }
pdf_parser.py ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PDF Parser Service for RAG Chatbot
3
+ Extracts text from PDF and splits into chunks for indexing
4
+ """
5
+
6
+ import pypdfium2 as pdfium
7
+ from typing import List, Dict, Optional
8
+ import re
9
+ from dataclasses import dataclass
10
+
11
+
12
+ @dataclass
13
+ class PDFChunk:
14
+ """Represents a chunk of text from PDF"""
15
+ text: str
16
+ page_number: int
17
+ chunk_index: int
18
+ metadata: Dict
19
+
20
+
21
+ class PDFParser:
22
+ """Parse PDF files and prepare for RAG indexing"""
23
+
24
+ def __init__(
25
+ self,
26
+ chunk_size: int = 500, # words per chunk
27
+ chunk_overlap: int = 50, # words overlap between chunks
28
+ min_chunk_size: int = 50 # minimum words in a chunk
29
+ ):
30
+ self.chunk_size = chunk_size
31
+ self.chunk_overlap = chunk_overlap
32
+ self.min_chunk_size = min_chunk_size
33
+
34
+ def extract_text_from_pdf(self, pdf_path: str) -> Dict[int, str]:
35
+ """
36
+ Extract text from PDF file
37
+
38
+ Args:
39
+ pdf_path: Path to PDF file
40
+
41
+ Returns:
42
+ Dictionary mapping page number to text content
43
+ """
44
+ pdf_text = {}
45
+
46
+ try:
47
+ pdf = pdfium.PdfDocument(pdf_path)
48
+
49
+ for page_num in range(len(pdf)):
50
+ page = pdf[page_num]
51
+ textpage = page.get_textpage()
52
+ text = textpage.get_text_range()
53
+
54
+ # Clean text
55
+ text = self._clean_text(text)
56
+ pdf_text[page_num + 1] = text # 1-indexed pages
57
+
58
+ return pdf_text
59
+
60
+ except Exception as e:
61
+ raise Exception(f"Error reading PDF: {str(e)}")
62
+
63
+ def _clean_text(self, text: str) -> str:
64
+ """Clean extracted text"""
65
+ # Remove excessive whitespace
66
+ text = re.sub(r'\s+', ' ', text)
67
+
68
+ # Remove special characters that might cause issues
69
+ text = text.replace('\x00', '')
70
+
71
+ return text.strip()
72
+
73
+ def chunk_text(self, text: str, page_number: int) -> List[PDFChunk]:
74
+ """
75
+ Split text into overlapping chunks
76
+
77
+ Args:
78
+ text: Text to chunk
79
+ page_number: Page number this text came from
80
+
81
+ Returns:
82
+ List of PDFChunk objects
83
+ """
84
+ # Split into words
85
+ words = text.split()
86
+
87
+ if len(words) < self.min_chunk_size:
88
+ # Text too short, return as single chunk
89
+ if len(words) > 0:
90
+ return [PDFChunk(
91
+ text=text,
92
+ page_number=page_number,
93
+ chunk_index=0,
94
+ metadata={'page': page_number, 'chunk': 0}
95
+ )]
96
+ return []
97
+
98
+ chunks = []
99
+ chunk_index = 0
100
+ start = 0
101
+
102
+ while start < len(words):
103
+ # Get chunk
104
+ end = min(start + self.chunk_size, len(words))
105
+ chunk_words = words[start:end]
106
+ chunk_text = ' '.join(chunk_words)
107
+
108
+ chunks.append(PDFChunk(
109
+ text=chunk_text,
110
+ page_number=page_number,
111
+ chunk_index=chunk_index,
112
+ metadata={
113
+ 'page': page_number,
114
+ 'chunk': chunk_index,
115
+ 'start_word': start,
116
+ 'end_word': end
117
+ }
118
+ ))
119
+
120
+ chunk_index += 1
121
+
122
+ # Move start position with overlap
123
+ start = end - self.chunk_overlap
124
+
125
+ # Avoid infinite loop
126
+ if start >= len(words) - self.min_chunk_size:
127
+ break
128
+
129
+ return chunks
130
+
131
+ def parse_pdf(
132
+ self,
133
+ pdf_path: str,
134
+ document_metadata: Optional[Dict] = None
135
+ ) -> List[PDFChunk]:
136
+ """
137
+ Parse entire PDF into chunks
138
+
139
+ Args:
140
+ pdf_path: Path to PDF file
141
+ document_metadata: Additional metadata for the document
142
+
143
+ Returns:
144
+ List of all chunks from the PDF
145
+ """
146
+ # Extract text from all pages
147
+ pages_text = self.extract_text_from_pdf(pdf_path)
148
+
149
+ # Chunk each page
150
+ all_chunks = []
151
+ for page_num, text in pages_text.items():
152
+ chunks = self.chunk_text(text, page_num)
153
+
154
+ # Add document metadata
155
+ if document_metadata:
156
+ for chunk in chunks:
157
+ chunk.metadata.update(document_metadata)
158
+
159
+ all_chunks.extend(chunks)
160
+
161
+ return all_chunks
162
+
163
+ def parse_pdf_bytes(
164
+ self,
165
+ pdf_bytes: bytes,
166
+ document_metadata: Optional[Dict] = None
167
+ ) -> List[PDFChunk]:
168
+ """
169
+ Parse PDF from bytes (for uploaded files)
170
+
171
+ Args:
172
+ pdf_bytes: PDF file as bytes
173
+ document_metadata: Additional metadata
174
+
175
+ Returns:
176
+ List of chunks
177
+ """
178
+ import tempfile
179
+ import os
180
+
181
+ # Save to temp file
182
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
183
+ tmp.write(pdf_bytes)
184
+ tmp_path = tmp.name
185
+
186
+ try:
187
+ chunks = self.parse_pdf(tmp_path, document_metadata)
188
+ return chunks
189
+ finally:
190
+ # Clean up temp file
191
+ if os.path.exists(tmp_path):
192
+ os.unlink(tmp_path)
193
+
194
+ def get_pdf_info(self, pdf_path: str) -> Dict:
195
+ """
196
+ Get basic info about PDF
197
+
198
+ Args:
199
+ pdf_path: Path to PDF file
200
+
201
+ Returns:
202
+ Dictionary with PDF information
203
+ """
204
+ try:
205
+ pdf = pdfium.PdfDocument(pdf_path)
206
+
207
+ info = {
208
+ 'num_pages': len(pdf),
209
+ 'file_path': pdf_path,
210
+ }
211
+
212
+ return info
213
+
214
+ except Exception as e:
215
+ raise Exception(f"Error reading PDF info: {str(e)}")
216
+
217
+
218
+ class PDFIndexer:
219
+ """Index PDF chunks into RAG system"""
220
+
221
+ def __init__(self, embedding_service, qdrant_service, documents_collection):
222
+ self.embedding_service = embedding_service
223
+ self.qdrant_service = qdrant_service
224
+ self.documents_collection = documents_collection
225
+ self.parser = PDFParser()
226
+
227
+ def index_pdf(
228
+ self,
229
+ pdf_path: str,
230
+ document_id: str,
231
+ document_metadata: Optional[Dict] = None
232
+ ) -> Dict:
233
+ """
234
+ Index entire PDF into RAG system
235
+
236
+ Args:
237
+ pdf_path: Path to PDF file
238
+ document_id: Unique ID for this document
239
+ document_metadata: Additional metadata (title, author, etc.)
240
+
241
+ Returns:
242
+ Indexing results
243
+ """
244
+ # Parse PDF
245
+ chunks = self.parser.parse_pdf(pdf_path, document_metadata)
246
+
247
+ # Index each chunk
248
+ indexed_count = 0
249
+ chunk_ids = []
250
+
251
+ for chunk in chunks:
252
+ # Generate unique ID for chunk
253
+ chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
254
+
255
+ # Generate embedding
256
+ embedding = self.embedding_service.encode_text(chunk.text)
257
+
258
+ # Prepare metadata
259
+ metadata = {
260
+ 'text': chunk.text,
261
+ 'document_id': document_id,
262
+ 'page': chunk.page_number,
263
+ 'chunk_index': chunk.chunk_index,
264
+ 'source': 'pdf',
265
+ **chunk.metadata
266
+ }
267
+
268
+ # Index to Qdrant
269
+ self.qdrant_service.index_data(
270
+ doc_id=chunk_id,
271
+ embedding=embedding,
272
+ metadata=metadata
273
+ )
274
+
275
+ chunk_ids.append(chunk_id)
276
+ indexed_count += 1
277
+
278
+ # Save document info to MongoDB
279
+ doc_info = {
280
+ 'document_id': document_id,
281
+ 'type': 'pdf',
282
+ 'file_path': pdf_path,
283
+ 'num_chunks': indexed_count,
284
+ 'chunk_ids': chunk_ids,
285
+ 'metadata': document_metadata or {},
286
+ 'pdf_info': self.parser.get_pdf_info(pdf_path)
287
+ }
288
+ self.documents_collection.insert_one(doc_info)
289
+
290
+ return {
291
+ 'success': True,
292
+ 'document_id': document_id,
293
+ 'chunks_indexed': indexed_count,
294
+ 'chunk_ids': chunk_ids[:5] # Return first 5 as sample
295
+ }
296
+
297
+ def index_pdf_bytes(
298
+ self,
299
+ pdf_bytes: bytes,
300
+ document_id: str,
301
+ filename: str,
302
+ document_metadata: Optional[Dict] = None
303
+ ) -> Dict:
304
+ """
305
+ Index PDF from bytes (for uploaded files)
306
+
307
+ Args:
308
+ pdf_bytes: PDF file as bytes
309
+ document_id: Unique ID for this document
310
+ filename: Original filename
311
+ document_metadata: Additional metadata
312
+
313
+ Returns:
314
+ Indexing results
315
+ """
316
+ # Parse PDF
317
+ metadata = document_metadata or {}
318
+ metadata['filename'] = filename
319
+
320
+ chunks = self.parser.parse_pdf_bytes(pdf_bytes, metadata)
321
+
322
+ # Index each chunk
323
+ indexed_count = 0
324
+ chunk_ids = []
325
+
326
+ for chunk in chunks:
327
+ # Generate unique ID for chunk
328
+ chunk_id = f"{document_id}_p{chunk.page_number}_c{chunk.chunk_index}"
329
+
330
+ # Generate embedding
331
+ embedding = self.embedding_service.encode_text(chunk.text)
332
+
333
+ # Prepare metadata
334
+ metadata = {
335
+ 'text': chunk.text,
336
+ 'document_id': document_id,
337
+ 'page': chunk.page_number,
338
+ 'chunk_index': chunk.chunk_index,
339
+ 'source': 'pdf',
340
+ 'filename': filename,
341
+ **chunk.metadata
342
+ }
343
+
344
+ # Index to Qdrant
345
+ self.qdrant_service.index_data(
346
+ doc_id=chunk_id,
347
+ embedding=embedding,
348
+ metadata=metadata
349
+ )
350
+
351
+ chunk_ids.append(chunk_id)
352
+ indexed_count += 1
353
+
354
+ # Save document info to MongoDB
355
+ doc_info = {
356
+ 'document_id': document_id,
357
+ 'type': 'pdf',
358
+ 'filename': filename,
359
+ 'num_chunks': indexed_count,
360
+ 'chunk_ids': chunk_ids,
361
+ 'metadata': metadata
362
+ }
363
+ self.documents_collection.insert_one(doc_info)
364
+
365
+ return {
366
+ 'success': True,
367
+ 'document_id': document_id,
368
+ 'filename': filename,
369
+ 'chunks_indexed': indexed_count,
370
+ 'chunk_ids': chunk_ids[:5]
371
+ }
qdrant_service.py ADDED
@@ -0,0 +1,447 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from qdrant_client import QdrantClient
2
+ from qdrant_client.models import (
3
+ Distance, VectorParams, PointStruct,
4
+ SearchRequest, SearchParams, HnswConfigDiff,
5
+ OptimizersConfigDiff, ScalarQuantization,
6
+ ScalarQuantizationConfig, ScalarType,
7
+ QuantizationSearchParams
8
+ )
9
+ from typing import List, Dict, Any, Optional
10
+ import numpy as np
11
+ import uuid
12
+ import os
13
+
14
+
15
+ class QdrantVectorService:
16
+ """
17
+ Qdrant Cloud Vector Database Service với cấu hình tối ưu
18
+ - HNSW algorithm với parameters mạnh mẽ nhất
19
+ - Scalar Quantization để tối ưu memory và speed
20
+ - Hỗ trợ hybrid search (text + image)
21
+ """
22
+
23
+ def __init__(
24
+ self,
25
+ url: Optional[str] = None,
26
+ api_key: Optional[str] = None,
27
+ collection_name: str = "event_social_media",
28
+ vector_size: int = 1024, # Jina CLIP v2 dimension
29
+ ):
30
+ """
31
+ Initialize Qdrant Cloud client
32
+
33
+ Args:
34
+ url: Qdrant Cloud URL (từ env hoặc truyền vào)
35
+ api_key: Qdrant API key (từ env hoặc truyền vào)
36
+ collection_name: Tên collection
37
+ vector_size: Dimension của vectors (1024 cho Jina CLIP v2)
38
+ """
39
+ # Lấy credentials từ env nếu không truyền vào
40
+ self.url = url or os.getenv("QDRANT_URL")
41
+ self.api_key = api_key or os.getenv("QDRANT_API_KEY")
42
+
43
+ if not self.url or not self.api_key:
44
+ raise ValueError("Cần cung cấp QDRANT_URL và QDRANT_API_KEY (qua env hoặc params)")
45
+
46
+ print(f"Connecting to Qdrant Cloud...")
47
+
48
+ # Initialize Qdrant Cloud client
49
+ self.client = QdrantClient(
50
+ url=self.url,
51
+ api_key=self.api_key,
52
+ )
53
+
54
+ self.collection_name = collection_name
55
+ self.vector_size = vector_size
56
+
57
+ # Create collection nếu chưa tồn tại
58
+ self._ensure_collection()
59
+
60
+ print(f"✓ Connected to Qdrant collection: {collection_name}")
61
+
62
+ def _ensure_collection(self):
63
+ """
64
+ Tạo collection với HNSW config tối ưu nhất
65
+ """
66
+ # Check nếu collection đã tồn tại
67
+ collections = self.client.get_collections().collections
68
+ collection_exists = any(c.name == self.collection_name for c in collections)
69
+
70
+ if not collection_exists:
71
+ print(f"Creating collection {self.collection_name} with optimal HNSW config...")
72
+
73
+ self.client.create_collection(
74
+ collection_name=self.collection_name,
75
+ vectors_config=VectorParams(
76
+ size=self.vector_size,
77
+ distance=Distance.COSINE, # Cosine similarity cho embeddings
78
+ hnsw_config=HnswConfigDiff(
79
+ m=64, # Số edges per node - cao nhất cho accuracy
80
+ ef_construct=512, # Search range khi build index - cao cho quality
81
+ full_scan_threshold=10000, # Threshold để switch sang full scan
82
+ max_indexing_threads=0, # Auto-detect số threads
83
+ on_disk=False, # Keep trong RAM cho speed (nếu đủ memory)
84
+ )
85
+ ),
86
+ optimizers_config=OptimizersConfigDiff(
87
+ deleted_threshold=0.2,
88
+ vacuum_min_vector_number=1000,
89
+ default_segment_number=2,
90
+ max_segment_size=200000,
91
+ memmap_threshold=50000,
92
+ indexing_threshold=10000,
93
+ flush_interval_sec=5,
94
+ max_optimization_threads=0, # Auto-detect
95
+ ),
96
+ # Sử dụng Scalar Quantization để tối ưu memory và speed
97
+ quantization_config=ScalarQuantization(
98
+ scalar=ScalarQuantizationConfig(
99
+ type=ScalarType.INT8,
100
+ quantile=0.99,
101
+ always_ram=True, # Keep quantized vectors trong RAM
102
+ )
103
+ )
104
+ )
105
+ print("✓ Collection created with optimal configuration")
106
+ else:
107
+ print("✓ Collection already exists")
108
+
109
+ def _convert_to_valid_id(self, doc_id: str) -> str:
110
+ """
111
+ Convert bất kỳ string ID nào thành UUID hợp lệ cho Qdrant
112
+
113
+ Args:
114
+ doc_id: Original ID (có thể là MongoDB ObjectId, string, etc.)
115
+
116
+ Returns:
117
+ UUID string hợp lệ
118
+ """
119
+ if not doc_id:
120
+ return str(uuid.uuid4())
121
+
122
+ # Nếu đã là UUID hợp lệ, giữ nguyên
123
+ try:
124
+ uuid.UUID(doc_id)
125
+ return doc_id
126
+ except ValueError:
127
+ pass
128
+
129
+ # Convert string sang UUID deterministic (cùng input = cùng UUID)
130
+ # Sử dụng UUID v5 với namespace DNS
131
+ return str(uuid.uuid5(uuid.NAMESPACE_DNS, doc_id))
132
+
133
+ def index_data(
134
+ self,
135
+ doc_id: str,
136
+ embedding: np.ndarray,
137
+ metadata: Dict[str, Any]
138
+ ) -> Dict[str, str]:
139
+ """
140
+ Index data vào Qdrant
141
+
142
+ Args:
143
+ doc_id: ID của document (MongoDB ObjectId, string, etc.)
144
+ embedding: Vector embedding từ Jina CLIP
145
+ metadata: Metadata (text, image_url, event_info, etc.)
146
+
147
+ Returns:
148
+ Dict với original_id và qdrant_id
149
+ """
150
+ # Convert ID thành UUID hợp lệ
151
+ qdrant_id = self._convert_to_valid_id(doc_id)
152
+
153
+ # Lưu original ID vào metadata
154
+ metadata['original_id'] = doc_id
155
+
156
+ # Ensure embedding là 1D array
157
+ if len(embedding.shape) > 1:
158
+ embedding = embedding.flatten()
159
+
160
+ # Create point
161
+ point = PointStruct(
162
+ id=qdrant_id,
163
+ vector=embedding.tolist(),
164
+ payload=metadata
165
+ )
166
+
167
+ # Upsert vào collection
168
+ self.client.upsert(
169
+ collection_name=self.collection_name,
170
+ points=[point]
171
+ )
172
+
173
+ return {
174
+ "original_id": doc_id,
175
+ "qdrant_id": qdrant_id
176
+ }
177
+
178
+ def batch_index(
179
+ self,
180
+ doc_ids: List[str],
181
+ embeddings: np.ndarray,
182
+ metadata_list: List[Dict[str, Any]]
183
+ ) -> List[Dict[str, str]]:
184
+ """
185
+ Batch index nhiều documents cùng lúc
186
+
187
+ Args:
188
+ doc_ids: List of document IDs (MongoDB ObjectId, string, etc.)
189
+ embeddings: Numpy array of embeddings (n_samples, embedding_dim)
190
+ metadata_list: List of metadata dicts
191
+
192
+ Returns:
193
+ List of dicts với original_id và qdrant_id
194
+ """
195
+ points = []
196
+ id_mappings = []
197
+
198
+ for i, (doc_id, embedding, metadata) in enumerate(zip(doc_ids, embeddings, metadata_list)):
199
+ # Convert to valid UUID
200
+ qdrant_id = self._convert_to_valid_id(doc_id)
201
+
202
+ # Lưu original ID vào metadata
203
+ metadata['original_id'] = doc_id
204
+
205
+ # Ensure embedding là 1D
206
+ if len(embedding.shape) > 1:
207
+ embedding = embedding.flatten()
208
+
209
+ points.append(PointStruct(
210
+ id=qdrant_id,
211
+ vector=embedding.tolist(),
212
+ payload=metadata
213
+ ))
214
+
215
+ id_mappings.append({
216
+ "original_id": doc_id,
217
+ "qdrant_id": qdrant_id
218
+ })
219
+
220
+ # Batch upsert
221
+ self.client.upsert(
222
+ collection_name=self.collection_name,
223
+ points=points,
224
+ wait=True # Wait for indexing to complete
225
+ )
226
+
227
+ return id_mappings
228
+
229
+ def search(
230
+ self,
231
+ query_embedding: np.ndarray,
232
+ limit: int = 10,
233
+ score_threshold: Optional[float] = None,
234
+ filter_conditions: Optional[Dict] = None,
235
+ ef: int = 256 # Search quality parameter - cao hơn = accurate hơn
236
+ ) -> List[Dict[str, Any]]:
237
+ """
238
+ Search similar vectors trong Qdrant
239
+
240
+ Args:
241
+ query_embedding: Query embedding từ Jina CLIP
242
+ limit: Số lượng results trả về
243
+ score_threshold: Minimum similarity score (0-1)
244
+ filter_conditions: Qdrant filter conditions
245
+ ef: HNSW search parameter (128-512, cao hơn = accurate hơn)
246
+
247
+ Returns:
248
+ List of search results với id, score, và metadata
249
+ """
250
+ # Ensure query embedding là 1D
251
+ if len(query_embedding.shape) > 1:
252
+ query_embedding = query_embedding.flatten()
253
+
254
+ # Search với HNSW parameters tối ưu
255
+ search_result = self.client.search(
256
+ collection_name=self.collection_name,
257
+ query_vector=query_embedding.tolist(),
258
+ limit=limit,
259
+ score_threshold=score_threshold,
260
+ query_filter=filter_conditions,
261
+ search_params=SearchParams(
262
+ hnsw_ef=ef, # Higher ef = more accurate search
263
+ exact=False, # Use HNSW (not exact search)
264
+ quantization=QuantizationSearchParams(
265
+ ignore=False, # Use quantization
266
+ rescore=True, # Rescore với original vectors
267
+ oversampling=2.0 # Oversample factor
268
+ )
269
+ ),
270
+ with_payload=True,
271
+ with_vectors=False # Không cần return vectors
272
+ )
273
+
274
+ # Format results - trả về original_id thay vì UUID
275
+ results = []
276
+ for hit in search_result:
277
+ # Lấy original_id từ metadata (MongoDB ObjectId)
278
+ original_id = hit.payload.get('original_id', hit.id)
279
+
280
+ results.append({
281
+ "id": original_id, # Trả về MongoDB ObjectId
282
+ "qdrant_id": hit.id, # UUID trong Qdrant
283
+ "confidence": float(hit.score), # Cosine similarity score
284
+ "metadata": hit.payload
285
+ })
286
+
287
+ return results
288
+
289
+ def hybrid_search(
290
+ self,
291
+ text_embedding: Optional[np.ndarray] = None,
292
+ image_embedding: Optional[np.ndarray] = None,
293
+ text_weight: float = 0.5,
294
+ image_weight: float = 0.5,
295
+ limit: int = 10,
296
+ score_threshold: Optional[float] = None,
297
+ ef: int = 256
298
+ ) -> List[Dict[str, Any]]:
299
+ """
300
+ Hybrid search với cả text và image embeddings
301
+
302
+ Args:
303
+ text_embedding: Text query embedding
304
+ image_embedding: Image query embedding
305
+ text_weight: Weight cho text search (0-1)
306
+ image_weight: Weight cho image search (0-1)
307
+ limit: Số results
308
+ score_threshold: Minimum score
309
+ ef: HNSW search parameter
310
+
311
+ Returns:
312
+ Combined search results
313
+ """
314
+ # Combine embeddings với weights
315
+ combined_embedding = np.zeros(self.vector_size)
316
+
317
+ if text_embedding is not None:
318
+ if len(text_embedding.shape) > 1:
319
+ text_embedding = text_embedding.flatten()
320
+ combined_embedding += text_weight * text_embedding
321
+
322
+ if image_embedding is not None:
323
+ if len(image_embedding.shape) > 1:
324
+ image_embedding = image_embedding.flatten()
325
+ combined_embedding += image_weight * image_embedding
326
+
327
+ # Normalize combined embedding
328
+ norm = np.linalg.norm(combined_embedding)
329
+ if norm > 0:
330
+ combined_embedding = combined_embedding / norm
331
+
332
+ # Search với combined embedding
333
+ return self.search(
334
+ query_embedding=combined_embedding,
335
+ limit=limit,
336
+ score_threshold=score_threshold,
337
+ ef=ef
338
+ )
339
+
340
+ def delete_by_id(self, doc_id: str) -> bool:
341
+ """
342
+ Delete document by ID (hỗ trợ cả MongoDB ObjectId và UUID)
343
+
344
+ Args:
345
+ doc_id: Document ID to delete (MongoDB ObjectId hoặc UUID)
346
+
347
+ Returns:
348
+ Success status
349
+ """
350
+ # Convert to UUID nếu là MongoDB ObjectId
351
+ qdrant_id = self._convert_to_valid_id(doc_id)
352
+
353
+ self.client.delete(
354
+ collection_name=self.collection_name,
355
+ points_selector=[qdrant_id]
356
+ )
357
+ return True
358
+
359
+ def get_by_id(self, doc_id: str) -> Optional[Dict[str, Any]]:
360
+ """
361
+ Get document by ID (hỗ trợ cả MongoDB ObjectId và UUID)
362
+
363
+ Args:
364
+ doc_id: Document ID (MongoDB ObjectId hoặc UUID)
365
+
366
+ Returns:
367
+ Document data hoặc None nếu không tìm thấy
368
+ """
369
+ # Convert to UUID nếu là MongoDB ObjectId
370
+ qdrant_id = self._convert_to_valid_id(doc_id)
371
+
372
+ try:
373
+ result = self.client.retrieve(
374
+ collection_name=self.collection_name,
375
+ ids=[qdrant_id],
376
+ with_payload=True,
377
+ with_vectors=False
378
+ )
379
+
380
+ if result:
381
+ point = result[0]
382
+ original_id = point.payload.get('original_id', point.id)
383
+ return {
384
+ "id": original_id, # MongoDB ObjectId
385
+ "qdrant_id": point.id, # UUID trong Qdrant
386
+ "metadata": point.payload
387
+ }
388
+ return None
389
+ except Exception as e:
390
+ print(f"Error retrieving document: {e}")
391
+ return None
392
+
393
+ def search_by_metadata(
394
+ self,
395
+ filter_conditions: Dict,
396
+ limit: int = 100
397
+ ) -> List[Dict[str, Any]]:
398
+ """
399
+ Search documents by metadata conditions (không cần embedding)
400
+
401
+ Args:
402
+ filter_conditions: Qdrant filter conditions
403
+ limit: Maximum số results
404
+
405
+ Returns:
406
+ List of matching documents
407
+ """
408
+ try:
409
+ result = self.client.scroll(
410
+ collection_name=self.collection_name,
411
+ scroll_filter=filter_conditions,
412
+ limit=limit,
413
+ with_payload=True,
414
+ with_vectors=False
415
+ )
416
+
417
+ documents = []
418
+ for point in result[0]: # result is tuple (points, next_page_offset)
419
+ original_id = point.payload.get('original_id', point.id)
420
+ documents.append({
421
+ "id": original_id, # MongoDB ObjectId
422
+ "qdrant_id": point.id, # UUID trong Qdrant
423
+ "metadata": point.payload
424
+ })
425
+
426
+ return documents
427
+ except Exception as e:
428
+ print(f"Error searching by metadata: {e}")
429
+ return []
430
+
431
+ def get_collection_info(self) -> Dict[str, Any]:
432
+ """
433
+ Lấy thông tin collection
434
+
435
+ Returns:
436
+ Collection info
437
+ """
438
+ info = self.client.get_collection(collection_name=self.collection_name)
439
+ return {
440
+ "vectors_count": info.vectors_count,
441
+ "points_count": info.points_count,
442
+ "status": info.status,
443
+ "config": {
444
+ "distance": info.config.params.vectors.distance,
445
+ "size": info.config.params.vectors.size,
446
+ }
447
+ }
requirements.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FastAPI và web framework
2
+ fastapi==0.115.5
3
+ uvicorn[standard]==0.32.1
4
+ python-multipart==0.0.20
5
+
6
+ # Gradio cho Hugging Face Spaces
7
+ gradio>=4.0.0
8
+
9
+ # Machine Learning & Embeddings
10
+ torch>=2.0.0
11
+ transformers>=4.50.0
12
+ onnxruntime==1.20.1
13
+ torchvision>=0.15.0
14
+ pillow>=10.0.0
15
+ numpy>=1.24.0
16
+
17
+ # Vector Database
18
+ qdrant-client>=1.12.1
19
+ grpcio>=1.60.0
20
+
21
+ # Utilities
22
+ pydantic>=2.0.0
23
+ python-dotenv==1.0.0
24
+
25
+ # MongoDB
26
+ pymongo>=4.6.0
27
+ huggingface-hub>=0.20.0
28
+ timm
29
+ einops
30
+
31
+ # PDF Processing
32
+ pypdfium2>=4.30.0
33
+
34
+
test_advanced_features.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test script for Advanced RAG features
3
+ Demonstrates new capabilities: multiple texts/images indexing and advanced RAG chat
4
+ """
5
+
6
+ import requests
7
+ import json
8
+ from typing import List, Optional
9
+
10
+
11
+ class AdvancedRAGTester:
12
+ """Test client for Advanced RAG API"""
13
+
14
+ def __init__(self, base_url: str = "http://localhost:8000"):
15
+ self.base_url = base_url
16
+
17
+ def test_multiple_index(self, doc_id: str, texts: List[str], image_paths: Optional[List[str]] = None):
18
+ """
19
+ Test indexing with multiple texts and images
20
+
21
+ Args:
22
+ doc_id: Document ID
23
+ texts: List of texts (max 10)
24
+ image_paths: List of image file paths (max 10)
25
+ """
26
+ print(f"\n{'='*60}")
27
+ print(f"TEST: Indexing document '{doc_id}' with multiple texts/images")
28
+ print(f"{'='*60}")
29
+
30
+ # Prepare form data
31
+ data = {'id': doc_id}
32
+
33
+ # Add texts
34
+ if texts:
35
+ if len(texts) > 10:
36
+ print("WARNING: Maximum 10 texts allowed. Taking first 10.")
37
+ texts = texts[:10]
38
+ data['texts'] = texts
39
+ print(f"✓ Texts: {len(texts)} items")
40
+
41
+ # Prepare files
42
+ files = []
43
+ if image_paths:
44
+ if len(image_paths) > 10:
45
+ print("WARNING: Maximum 10 images allowed. Taking first 10.")
46
+ image_paths = image_paths[:10]
47
+
48
+ for img_path in image_paths:
49
+ try:
50
+ files.append(('images', open(img_path, 'rb')))
51
+ except FileNotFoundError:
52
+ print(f"WARNING: Image not found: {img_path}")
53
+
54
+ print(f"✓ Images: {len(files)} files")
55
+
56
+ # Make request
57
+ try:
58
+ response = requests.post(f"{self.base_url}/index", data=data, files=files)
59
+ response.raise_for_status()
60
+
61
+ result = response.json()
62
+ print(f"\n✓ SUCCESS")
63
+ print(f" - Document ID: {result['id']}")
64
+ print(f" - Message: {result['message']}")
65
+ return result
66
+
67
+ except requests.exceptions.RequestException as e:
68
+ print(f"\n✗ ERROR: {e}")
69
+ if hasattr(e.response, 'text'):
70
+ print(f" Response: {e.response.text}")
71
+ return None
72
+
73
+ finally:
74
+ # Close file handles
75
+ for _, file_obj in files:
76
+ file_obj.close()
77
+
78
+ def test_advanced_rag_chat(
79
+ self,
80
+ message: str,
81
+ hf_token: Optional[str] = None,
82
+ use_advanced_rag: bool = True,
83
+ use_reranking: bool = True,
84
+ use_compression: bool = True,
85
+ top_k: int = 3,
86
+ score_threshold: float = 0.5
87
+ ):
88
+ """
89
+ Test advanced RAG chat
90
+
91
+ Args:
92
+ message: User question
93
+ hf_token: Hugging Face token (optional)
94
+ use_advanced_rag: Use advanced RAG pipeline
95
+ use_reranking: Enable reranking
96
+ use_compression: Enable context compression
97
+ top_k: Number of documents to retrieve
98
+ score_threshold: Minimum relevance score
99
+ """
100
+ print(f"\n{'='*60}")
101
+ print(f"TEST: Advanced RAG Chat")
102
+ print(f"{'='*60}")
103
+ print(f"Question: {message}")
104
+ print(f"Advanced RAG: {use_advanced_rag}")
105
+ print(f"Reranking: {use_reranking}")
106
+ print(f"Compression: {use_compression}")
107
+
108
+ payload = {
109
+ 'message': message,
110
+ 'use_rag': True,
111
+ 'use_advanced_rag': use_advanced_rag,
112
+ 'use_reranking': use_reranking,
113
+ 'use_compression': use_compression,
114
+ 'top_k': top_k,
115
+ 'score_threshold': score_threshold,
116
+ }
117
+
118
+ if hf_token:
119
+ payload['hf_token'] = hf_token
120
+
121
+ try:
122
+ response = requests.post(f"{self.base_url}/chat", json=payload)
123
+ response.raise_for_status()
124
+
125
+ result = response.json()
126
+
127
+ print(f"\n✓ SUCCESS")
128
+ print(f"\n--- Answer ---")
129
+ print(result['response'])
130
+
131
+ print(f"\n--- Retrieved Context ({len(result['context_used'])} documents) ---")
132
+ for i, ctx in enumerate(result['context_used'], 1):
133
+ print(f"{i}. [{ctx['id']}] Confidence: {ctx['confidence']:.2%}")
134
+ text_preview = ctx['metadata'].get('text', '')[:100]
135
+ print(f" Text: {text_preview}...")
136
+
137
+ if result.get('rag_stats'):
138
+ print(f"\n--- RAG Pipeline Statistics ---")
139
+ stats = result['rag_stats']
140
+ print(f" Original query: {stats.get('original_query')}")
141
+ print(f" Expanded queries: {stats.get('expanded_queries')}")
142
+ print(f" Initial results: {stats.get('initial_results')}")
143
+ print(f" After reranking: {stats.get('after_rerank')}")
144
+ print(f" After compression: {stats.get('after_compression')}")
145
+
146
+ return result
147
+
148
+ except requests.exceptions.RequestException as e:
149
+ print(f"\n✗ ERROR: {e}")
150
+ if hasattr(e.response, 'text'):
151
+ print(f" Response: {e.response.text}")
152
+ return None
153
+
154
+ def compare_basic_vs_advanced_rag(self, message: str, hf_token: Optional[str] = None):
155
+ """Compare basic RAG vs advanced RAG side by side"""
156
+ print(f"\n{'='*60}")
157
+ print(f"COMPARISON: Basic RAG vs Advanced RAG")
158
+ print(f"{'='*60}")
159
+ print(f"Question: {message}\n")
160
+
161
+ # Test Basic RAG
162
+ print("\n--- BASIC RAG ---")
163
+ basic_result = self.test_advanced_rag_chat(
164
+ message=message,
165
+ hf_token=hf_token,
166
+ use_advanced_rag=False
167
+ )
168
+
169
+ # Test Advanced RAG
170
+ print("\n--- ADVANCED RAG ---")
171
+ advanced_result = self.test_advanced_rag_chat(
172
+ message=message,
173
+ hf_token=hf_token,
174
+ use_advanced_rag=True
175
+ )
176
+
177
+ # Compare
178
+ print(f"\n{'='*60}")
179
+ print("COMPARISON SUMMARY")
180
+ print(f"{'='*60}")
181
+
182
+ if basic_result and advanced_result:
183
+ print(f"Basic RAG:")
184
+ print(f" - Retrieved docs: {len(basic_result['context_used'])}")
185
+
186
+ print(f"\nAdvanced RAG:")
187
+ print(f" - Retrieved docs: {len(advanced_result['context_used'])}")
188
+ if advanced_result.get('rag_stats'):
189
+ stats = advanced_result['rag_stats']
190
+ print(f" - Query expansion: {len(stats.get('expanded_queries', []))} variants")
191
+ print(f" - Initial retrieval: {stats.get('initial_results', 0)} docs")
192
+ print(f" - After reranking: {stats.get('after_rerank', 0)} docs")
193
+
194
+
195
+ def main():
196
+ """Run tests"""
197
+ tester = AdvancedRAGTester()
198
+
199
+ print("="*60)
200
+ print("ADVANCED RAG FEATURE TESTS")
201
+ print("="*60)
202
+
203
+ # Test 1: Index with multiple texts (no images for demo)
204
+ print("\n\n### TEST 1: Index Multiple Texts ###")
205
+ tester.test_multiple_index(
206
+ doc_id="event_music_festival_2025",
207
+ texts=[
208
+ "Festival âm nhạc quốc tế Hà Nội 2025",
209
+ "Thời gian: 15-17 tháng 11 năm 2025",
210
+ "Địa điểm: Công viên Thống Nhất, Hà Nội",
211
+ "Line-up: Sơn Tùng MTP, Đen Vâu, Hoàng Thùy Linh, Mỹ Tâm",
212
+ "Giá vé: Early bird 500.000đ, VIP 2.000.000đ",
213
+ "Dự kiến 50.000 khán giả tham dự",
214
+ "3 sân khấu chính, 5 food court, khu vực cắm trại"
215
+ ]
216
+ )
217
+
218
+ # Test 2: Index another document
219
+ print("\n\n### TEST 2: Index Another Document ###")
220
+ tester.test_multiple_index(
221
+ doc_id="safety_guidelines",
222
+ texts=[
223
+ "Vũ khí và đồ vật nguy hiểm bị cấm mang vào sự kiện",
224
+ "Dao, kiếm, súng và các loại vũ khí nguy hiểm nghiêm cấm",
225
+ "An ninh sẽ kiểm tra tất cả túi xách và đồ mang theo",
226
+ "Vi phạm sẽ bị tịch thu và có thể bị trục xuất khỏi sự kiện"
227
+ ]
228
+ )
229
+
230
+ # Test 3: Basic chat (without HF token - will show placeholder)
231
+ print("\n\n### TEST 3: Basic RAG Chat (No LLM) ###")
232
+ tester.test_advanced_rag_chat(
233
+ message="Festival Hà Nội diễn ra khi nào?",
234
+ use_advanced_rag=False
235
+ )
236
+
237
+ # Test 4: Advanced RAG chat
238
+ print("\n\n### TEST 4: Advanced RAG Chat (No LLM) ###")
239
+ tester.test_advanced_rag_chat(
240
+ message="Festival Hà Nội diễn ra khi nào và có những nghệ sĩ nào?",
241
+ use_advanced_rag=True,
242
+ use_reranking=True,
243
+ use_compression=True
244
+ )
245
+
246
+ # Test 5: Compare basic vs advanced
247
+ print("\n\n### TEST 5: Comparison Test ###")
248
+ tester.compare_basic_vs_advanced_rag(
249
+ message="Dao có được mang vào sự kiện không?"
250
+ )
251
+
252
+ print("\n\n" + "="*60)
253
+ print("ALL TESTS COMPLETED")
254
+ print("="*60)
255
+ print("\nNOTE: To test with actual LLM responses, add your Hugging Face token:")
256
+ print(" tester.test_advanced_rag_chat(message='...', hf_token='hf_xxxxx')")
257
+
258
+
259
+ if __name__ == "__main__":
260
+ main()
verify_dependencies.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Verify all dependencies are installed correctly
3
+ Run: python verify_dependencies.py
4
+ """
5
+
6
+ import sys
7
+
8
+ def check_dependency(module_name, package_name=None):
9
+ """Check if a dependency is installed"""
10
+ if package_name is None:
11
+ package_name = module_name
12
+
13
+ try:
14
+ __import__(module_name)
15
+ print(f"✓ {package_name}")
16
+ return True
17
+ except ImportError as e:
18
+ print(f"✗ {package_name} - NOT INSTALLED")
19
+ print(f" Error: {e}")
20
+ return False
21
+
22
+
23
+ def main():
24
+ print("="*60)
25
+ print("Dependency Verification")
26
+ print("="*60)
27
+
28
+ dependencies = [
29
+ # Web framework
30
+ ("fastapi", "fastapi"),
31
+ ("uvicorn", "uvicorn"),
32
+ ("multipart", "python-multipart"),
33
+
34
+ # ML & Embeddings
35
+ ("torch", "torch"),
36
+ ("transformers", "transformers"),
37
+ ("PIL", "pillow"),
38
+ ("numpy", "numpy"),
39
+
40
+ # Vector DB
41
+ ("qdrant_client", "qdrant-client"),
42
+
43
+ # Utilities
44
+ ("pydantic", "pydantic"),
45
+ ("dotenv", "python-dotenv"),
46
+
47
+ # MongoDB
48
+ ("pymongo", "pymongo"),
49
+ ("huggingface_hub", "huggingface-hub"),
50
+ ("timm", "timm"),
51
+ ("einops", "einops"),
52
+
53
+ # PDF Processing (NEW)
54
+ ("pypdfium2", "pypdfium2"),
55
+ ]
56
+
57
+ print("\nChecking dependencies...\n")
58
+
59
+ all_ok = True
60
+ for module, package in dependencies:
61
+ if not check_dependency(module, package):
62
+ all_ok = False
63
+
64
+ print("\n" + "="*60)
65
+ if all_ok:
66
+ print("✓ All dependencies installed successfully!")
67
+ print("\nYou can now run:")
68
+ print(" python main.py")
69
+ else:
70
+ print("✗ Some dependencies are missing!")
71
+ print("\nPlease install missing dependencies:")
72
+ print(" pip install -r requirements.txt")
73
+ sys.exit(1)
74
+
75
+ print("="*60)
76
+
77
+ # Check optional features
78
+ print("\nChecking system modules...\n")
79
+
80
+ # Check our custom modules
81
+ custom_modules = [
82
+ "embedding_service",
83
+ "qdrant_service",
84
+ "advanced_rag",
85
+ "pdf_parser",
86
+ "multimodal_pdf_parser",
87
+ ]
88
+
89
+ for module in custom_modules:
90
+ try:
91
+ __import__(module)
92
+ print(f"✓ {module}.py")
93
+ except ImportError as e:
94
+ print(f"✗ {module}.py - ERROR: {e}")
95
+
96
+ print("\n" + "="*60)
97
+ print("Verification complete!")
98
+ print("="*60)
99
+
100
+
101
+ if __name__ == "__main__":
102
+ main()