File size: 11,906 Bytes
8d119a4
 
 
 
 
 
 
 
 
 
 
 
8c0cf48
 
 
 
8d119a4
8c0cf48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d119a4
8c0cf48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d119a4
 
 
 
 
8c0cf48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d119a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c0cf48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d8fcf4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
title: MarkItDownTestingPlatform
emoji: πŸ“Š
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
short_description: Enterprise-Grade Document Conversion Testing with AI-Powered
---

# πŸš€ MarkItDown Testing Platform

**Enterprise-Grade Document Conversion Testing with AI-Powered Analysis**

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## 🎯 Overview

A comprehensive testing platform for Microsoft's MarkItDown document conversion tool, enhanced with Google Gemini AI analysis capabilities. Designed for enterprise-scale document processing workflows with focus on quality assessment and performance optimization.

### ✨ Key Features

- **πŸ”„ Multi-Format Support**: PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML
- **πŸ€– AI-Powered Analysis**: Google Gemini integration for quality assessment
- **πŸ“Š Interactive Dashboards**: Real-time visualization of conversion metrics
- **🏒 Enterprise-Ready**: Scalable architecture with comprehensive error handling
- **πŸ’Ύ Export Capabilities**: Multiple output formats for integration workflows
- **πŸ“ˆ Performance Monitoring**: Detailed analytics and optimization insights

## πŸš€ Quick Start

### Using the Hugging Face Space

1. **Visit the Space**: [MarkItDown Testing Platform](https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform)
2. **Upload Document**: Drag & drop or select your document
3. **Configure Analysis**: Enter Gemini API key for AI analysis (optional)
4. **Process**: Click "Process Document" and review results
5. **Export**: Download results in your preferred format

### Getting Gemini API Key

1. Visit [Google AI Studio](https://makersuite.google.com/app/apikey)
2. Create a new API key
3. Copy and paste into the application
4. Enjoy AI-powered document analysis!

## πŸ“‹ Supported File Formats

| Category | Formats | Notes |
|----------|---------|-------|
| **Documents** | PDF, DOCX, PPTX, XLSX | Full structure preservation |
| **Web Content** | HTML, HTM | Complete formatting retention |
| **Text Files** | TXT, CSV, JSON, XML | Enhanced parsing capabilities |
| **Rich Text** | RTF | Advanced formatting support |

## πŸ—οΈ Architecture Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Gradio Interface              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  File Upload β”‚ Config β”‚ Analysis β”‚ Exportβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         Processing Pipeline             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚MarkItDown β”‚ Gemini AI β”‚ Visualization  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        Analytics & Reporting           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Core Components

- **`core/modules.py`**: Stateless processing engine optimized for HF Spaces
- **`llm/gemini_connector.py`**: Enterprise Gemini API integration
- **`visualization/analytics_engine.py`**: Interactive dashboard generation
- **`app.py`**: Main Gradio application orchestration

## πŸ”§ Technical Specifications

### System Requirements
- **Python**: 3.10+
- **Memory**: Optimized for HF Spaces (16GB limit)
- **Storage**: Stateless design with temporary file handling
- **Processing**: Async pipeline with resource management

### Key Dependencies
```python
gradio>=4.44.0                  # Gradio interface (HF Spaces compatible)
markitdown[all]>=0.1.0          # Microsoft conversion engine
google-genai>=1.0.0             # Gemini integration (new client)
plotly>=5.17.0                  # Interactive visualizations
pandas>=1.5.0                   # Data processing
```

## πŸ“Š Analysis Capabilities

### Quality Metrics
- **Structure Score**: Heading, list, table preservation (0-10)
- **Completeness Score**: Information retention assessment (0-10)
- **Accuracy Score**: Formatting correctness evaluation (0-10)
- **Readability Score**: AI-friendly output optimization (0-10)

### AI Analysis Types
- **Quality Analysis**: Comprehensive conversion assessment
- **Structure Review**: Document hierarchy and organization
- **Content Summary**: Thematic analysis and key insights
- **Extraction Quality**: Data preservation evaluation

### Visualization Features
- **Quality Dashboard**: Multi-metric radar and performance charts
- **Structure Analysis**: Hierarchical document mapping
- **Comparison Tools**: Multi-document analysis capabilities
- **Performance Timeline**: Processing optimization insights

## 🎯 Use Cases

### Enterprise Document Migration
- **Legacy System Modernization**: Convert historical documents to modern formats
- **Content Management**: Standardize document formats across organizations
- **Compliance Documentation**: Ensure consistent formatting for regulatory requirements

### AI/ML Pipeline Integration
- **RAG System Preparation**: Optimize documents for retrieval systems
- **Training Data Processing**: Convert diverse formats for model training
- **Content Analysis**: Extract structured data from unstructured documents

### Quality Assurance
- **Conversion Validation**: Verify accuracy of automated processing
- **Performance Benchmarking**: Compare different conversion approaches
- **Error Detection**: Identify and resolve processing issues

## πŸ“ˆ Performance Optimization

### HF Spaces Optimizations
- **Memory Management**: Automatic cleanup and resource monitoring
- **Processing Limits**: Smart file size and timeout management
- **Async Processing**: Non-blocking operations for better UX
- **Error Recovery**: Graceful degradation and retry mechanisms

### Best Practices
- **File Preparation**: Use high-quality source documents
- **API Management**: Monitor Gemini API usage and limits
- **Result Analysis**: Review quality metrics for optimization opportunities
- **Export Strategy**: Choose appropriate formats for downstream processing

## πŸ› οΈ Development Setup

### Local Development
```bash
# Clone repository
git clone https://github.com/your-username/markitdown-testing-platform
cd markitdown-testing-platform

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

# Run application
python app.py
```

### Environment Variables
```bash
# Optional: Set custom configurations
export GRADIO_TEMP_DIR="/tmp"
export MAX_FILE_SIZE="52428800"  # 50MB in bytes
export PROCESSING_TIMEOUT="300"  # 5 minutes
```

### Deploying to Hugging Face Spaces

1. **Π‘Ρ‚Π²ΠΎΡ€Ρ–Ρ‚ΡŒ Space**
   - Π’Ρ–Π΄ΠΊΡ€ΠΈΠΉΡ‚Π΅ [huggingface.co/spaces/new](https://huggingface.co/spaces/new)
   - ΠžΠ±Π΅Ρ€Ρ–Ρ‚ΡŒ SDK **Gradio**, Π½Π°Π·Π²Ρƒ `DocSA/MarkItDownTestingPlatform`, runtime **Python 3.11**
   - `app_file` ΠΌΠ°Ρ” Π·Π°Π»ΠΈΡˆΠ°Ρ‚ΠΈΡΡ `app.py`

2. **Π—Π°ΠΏΡƒΡˆΡ‚Π΅ ΠΊΠΎΠ΄**
   ```bash
   git remote add hf https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform
   git push hf main
   ```

3. **ΠΠ°Π»Π°ΡˆΡ‚ΡƒΠΉΡ‚Π΅ сСкрСти Ρ‚Π° Π·ΠΌΡ–Π½Π½Ρ– сСрСдовища**
   - Π”ΠΎΠ΄Π°ΠΉΡ‚Π΅ сСкрСт `GEMINI_API_KEY` (Settings β†’ Repository secrets β†’ Add)
   - Π”ΠΎΠ΄Π°Ρ‚ΠΊΠΎΠ²Ρ– Π·ΠΌΡ–Π½Π½Ρ– (Π½Π΅ сСкрСтні): `MAX_FILE_SIZE_MB=50`, `PROCESSING_TIMEOUT=300`, `APP_VERSION=2.0.0-enterprise`

4. **ΠžΡΠΎΠ±Π»ΠΈΠ²ΠΎΡΡ‚Ρ– Ρ€Π°Π½Ρ‚Π°ΠΉΠΌΡƒ**
   - Gemini-Π°Π½Π°Π»Ρ–Π· Π²ΠΈΠΌΠΊΠ½Π΅Π½ΠΈΠΉ Π·Π° замовчуванням; користувач Π°ΠΊΡ‚ΠΈΠ²ΡƒΡ” ΠΉΠΎΠ³ΠΎ Π²Ρ€ΡƒΡ‡Π½Ρƒ
   - Π‘Ρ‚Π°Π½Π΄Π°Ρ€Ρ‚Π½Ρ– Π½Π°Π»Π°ΡˆΡ‚ΡƒΠ²Π°Π½Π½Ρ: Ρ‚ΠΈΠΏ Π°Π½Π°Π»Ρ–Π·Ρƒ **Content Summary**, модСль **Gemini 2.0 Flash**
   - ОбмСТСння ΠΊΠ²ΠΎΡ‚ Gemini ΠΎΠ±Ρ€ΠΎΠ±Π»ΡΡŽΡ‚ΡŒΡΡ Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΈΠΌΠΈ fallback-модСлями

## πŸ“š API Reference

### Core Processing Pipeline
```python
from core.modules import StreamlineFileHandler, HFConversionEngine
from llm.gemini_connector import GeminiAnalysisEngine

# Initialize components
handler = StreamlineFileHandler(resource_manager)
engine = HFConversionEngine(resource_manager, config)
gemini = GeminiAnalysisEngine(gemini_config)

# Process document
file_result = await handler.process_upload(file_obj)
conversion_result = await engine.convert_stream(file_content, metadata)
analysis_result = await gemini.analyze_content(analysis_request)
```

### Visualization Generation
```python
from visualization.analytics_engine import InteractiveVisualizationEngine

viz_engine = InteractiveVisualizationEngine()
dashboard = viz_engine.create_quality_dashboard(conversion_result, analysis_result)
structure_viz = viz_engine.create_structural_analysis_viz(conversion_result)
```

## πŸ” Security & Privacy

### Data Handling
- **No Persistent Storage**: All processing in memory with automatic cleanup
- **API Key Security**: Keys stored locally, never transmitted to servers
- **File Privacy**: Temporary files automatically deleted after processing
- **Error Logging**: Sanitized logs without sensitive information

### Compliance Features
- **GDPR Ready**: No personal data retention
- **Enterprise Security**: Secure API integrations
- **Audit Trail**: Comprehensive processing logs
- **Access Control**: Environment-based configuration

## 🀝 Contributing

### Development Guidelines
1. **Code Style**: Follow PEP 8 with Black formatting
2. **Testing**: Comprehensive unit and integration tests
3. **Documentation**: Detailed docstrings and README updates
4. **Performance**: Memory-efficient and HF Spaces optimized

### Pull Request Process
1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open Pull Request

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## πŸ™ Acknowledgments

- **Microsoft MarkItDown**: Core document conversion capabilities
- **Google Gemini**: Advanced AI analysis features
- **Hugging Face**: Platform hosting and community support
- **Plotly**: Interactive visualization framework
- **Gradio**: User interface framework

## πŸ“ž Support

### Getting Help
- **Documentation**: Comprehensive guides and examples
- **Issues**: [GitHub Issues](https://github.com/your-username/markitdown-testing-platform/issues)
- **Discussions**: [Community Forum](https://github.com/your-username/markitdown-testing-platform/discussions)
- **Email**: [email protected]

### Frequently Asked Questions

**Q: What's the maximum file size?**
A: 50MB for HF Spaces free tier. Larger files can be processed in local deployments.

**Q: Do I need a Gemini API key?**
A: No, basic conversion works without API key. Gemini key enables AI analysis features.

**Q: Can I process multiple files at once?**
A: Current version supports single-file processing. Batch processing available in advanced analytics.

**Q: How accurate are the quality scores?**
A: Scores are based on structural analysis and AI evaluation. Use as guidelines for optimization.

---

**Built with ❀️ for enterprise document processing**

*Last updated: September 2025*