Spaces:
Sleeping
Sleeping
File size: 11,906 Bytes
8d119a4 8c0cf48 8d119a4 8c0cf48 8d119a4 8c0cf48 8d119a4 8c0cf48 8d119a4 8c0cf48 1d8fcf4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 |
---
title: MarkItDownTestingPlatform
emoji: π
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
short_description: Enterprise-Grade Document Conversion Testing with AI-Powered
---
# π MarkItDown Testing Platform
**Enterprise-Grade Document Conversion Testing with AI-Powered Analysis**
[](https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
## π― Overview
A comprehensive testing platform for Microsoft's MarkItDown document conversion tool, enhanced with Google Gemini AI analysis capabilities. Designed for enterprise-scale document processing workflows with focus on quality assessment and performance optimization.
### β¨ Key Features
- **π Multi-Format Support**: PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML
- **π€ AI-Powered Analysis**: Google Gemini integration for quality assessment
- **π Interactive Dashboards**: Real-time visualization of conversion metrics
- **π’ Enterprise-Ready**: Scalable architecture with comprehensive error handling
- **πΎ Export Capabilities**: Multiple output formats for integration workflows
- **π Performance Monitoring**: Detailed analytics and optimization insights
## π Quick Start
### Using the Hugging Face Space
1. **Visit the Space**: [MarkItDown Testing Platform](https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform)
2. **Upload Document**: Drag & drop or select your document
3. **Configure Analysis**: Enter Gemini API key for AI analysis (optional)
4. **Process**: Click "Process Document" and review results
5. **Export**: Download results in your preferred format
### Getting Gemini API Key
1. Visit [Google AI Studio](https://makersuite.google.com/app/apikey)
2. Create a new API key
3. Copy and paste into the application
4. Enjoy AI-powered document analysis!
## π Supported File Formats
| Category | Formats | Notes |
|----------|---------|-------|
| **Documents** | PDF, DOCX, PPTX, XLSX | Full structure preservation |
| **Web Content** | HTML, HTM | Complete formatting retention |
| **Text Files** | TXT, CSV, JSON, XML | Enhanced parsing capabilities |
| **Rich Text** | RTF | Advanced formatting support |
## ποΈ Architecture Overview
```
βββββββββββββββββββββββββββββββββββββββββββ
β Gradio Interface β
βββββββββββββββββββββββββββββββββββββββββββ€
β File Upload β Config β Analysis β Exportβ
βββββββββββββββββββββββββββββββββββββββββββ€
β Processing Pipeline β
βββββββββββββββββββββββββββββββββββββββββββ€
βMarkItDown β Gemini AI β Visualization β
βββββββββββββββββββββββββββββββββββββββββββ€
β Analytics & Reporting β
βββββββββββββββββββββββββββββββββββββββββββ
```
### Core Components
- **`core/modules.py`**: Stateless processing engine optimized for HF Spaces
- **`llm/gemini_connector.py`**: Enterprise Gemini API integration
- **`visualization/analytics_engine.py`**: Interactive dashboard generation
- **`app.py`**: Main Gradio application orchestration
## π§ Technical Specifications
### System Requirements
- **Python**: 3.10+
- **Memory**: Optimized for HF Spaces (16GB limit)
- **Storage**: Stateless design with temporary file handling
- **Processing**: Async pipeline with resource management
### Key Dependencies
```python
gradio>=4.44.0 # Gradio interface (HF Spaces compatible)
markitdown[all]>=0.1.0 # Microsoft conversion engine
google-genai>=1.0.0 # Gemini integration (new client)
plotly>=5.17.0 # Interactive visualizations
pandas>=1.5.0 # Data processing
```
## π Analysis Capabilities
### Quality Metrics
- **Structure Score**: Heading, list, table preservation (0-10)
- **Completeness Score**: Information retention assessment (0-10)
- **Accuracy Score**: Formatting correctness evaluation (0-10)
- **Readability Score**: AI-friendly output optimization (0-10)
### AI Analysis Types
- **Quality Analysis**: Comprehensive conversion assessment
- **Structure Review**: Document hierarchy and organization
- **Content Summary**: Thematic analysis and key insights
- **Extraction Quality**: Data preservation evaluation
### Visualization Features
- **Quality Dashboard**: Multi-metric radar and performance charts
- **Structure Analysis**: Hierarchical document mapping
- **Comparison Tools**: Multi-document analysis capabilities
- **Performance Timeline**: Processing optimization insights
## π― Use Cases
### Enterprise Document Migration
- **Legacy System Modernization**: Convert historical documents to modern formats
- **Content Management**: Standardize document formats across organizations
- **Compliance Documentation**: Ensure consistent formatting for regulatory requirements
### AI/ML Pipeline Integration
- **RAG System Preparation**: Optimize documents for retrieval systems
- **Training Data Processing**: Convert diverse formats for model training
- **Content Analysis**: Extract structured data from unstructured documents
### Quality Assurance
- **Conversion Validation**: Verify accuracy of automated processing
- **Performance Benchmarking**: Compare different conversion approaches
- **Error Detection**: Identify and resolve processing issues
## π Performance Optimization
### HF Spaces Optimizations
- **Memory Management**: Automatic cleanup and resource monitoring
- **Processing Limits**: Smart file size and timeout management
- **Async Processing**: Non-blocking operations for better UX
- **Error Recovery**: Graceful degradation and retry mechanisms
### Best Practices
- **File Preparation**: Use high-quality source documents
- **API Management**: Monitor Gemini API usage and limits
- **Result Analysis**: Review quality metrics for optimization opportunities
- **Export Strategy**: Choose appropriate formats for downstream processing
## π οΈ Development Setup
### Local Development
```bash
# Clone repository
git clone https://github.com/your-username/markitdown-testing-platform
cd markitdown-testing-platform
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Run application
python app.py
```
### Environment Variables
```bash
# Optional: Set custom configurations
export GRADIO_TEMP_DIR="/tmp"
export MAX_FILE_SIZE="52428800" # 50MB in bytes
export PROCESSING_TIMEOUT="300" # 5 minutes
```
### Deploying to Hugging Face Spaces
1. **Π‘ΡΠ²ΠΎΡΡΡΡ Space**
- ΠΡΠ΄ΠΊΡΠΈΠΉΡΠ΅ [huggingface.co/spaces/new](https://huggingface.co/spaces/new)
- ΠΠ±Π΅ΡΡΡΡ SDK **Gradio**, Π½Π°Π·Π²Ρ `DocSA/MarkItDownTestingPlatform`, runtime **Python 3.11**
- `app_file` ΠΌΠ°Ρ Π·Π°Π»ΠΈΡΠ°ΡΠΈΡΡ `app.py`
2. **ΠΠ°ΠΏΡΡΡΠ΅ ΠΊΠΎΠ΄**
```bash
git remote add hf https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform
git push hf main
```
3. **ΠΠ°Π»Π°ΡΡΡΠΉΡΠ΅ ΡΠ΅ΠΊΡΠ΅ΡΠΈ ΡΠ° Π·ΠΌΡΠ½Π½Ρ ΡΠ΅ΡΠ΅Π΄ΠΎΠ²ΠΈΡΠ°**
- ΠΠΎΠ΄Π°ΠΉΡΠ΅ ΡΠ΅ΠΊΡΠ΅Ρ `GEMINI_API_KEY` (Settings β Repository secrets β Add)
- ΠΠΎΠ΄Π°ΡΠΊΠΎΠ²Ρ Π·ΠΌΡΠ½Π½Ρ (Π½Π΅ ΡΠ΅ΠΊΡΠ΅ΡΠ½Ρ): `MAX_FILE_SIZE_MB=50`, `PROCESSING_TIMEOUT=300`, `APP_VERSION=2.0.0-enterprise`
4. **ΠΡΠΎΠ±Π»ΠΈΠ²ΠΎΡΡΡ ΡΠ°Π½ΡΠ°ΠΉΠΌΡ**
- Gemini-Π°Π½Π°Π»ΡΠ· Π²ΠΈΠΌΠΊΠ½Π΅Π½ΠΈΠΉ Π·Π° Π·Π°ΠΌΠΎΠ²ΡΡΠ²Π°Π½Π½ΡΠΌ; ΠΊΠΎΡΠΈΡΡΡΠ²Π°Ρ Π°ΠΊΡΠΈΠ²ΡΡ ΠΉΠΎΠ³ΠΎ Π²ΡΡΡΠ½Ρ
- Π‘ΡΠ°Π½Π΄Π°ΡΡΠ½Ρ Π½Π°Π»Π°ΡΡΡΠ²Π°Π½Π½Ρ: ΡΠΈΠΏ Π°Π½Π°Π»ΡΠ·Ρ **Content Summary**, ΠΌΠΎΠ΄Π΅Π»Ρ **Gemini 2.0 Flash**
- ΠΠ±ΠΌΠ΅ΠΆΠ΅Π½Π½Ρ ΠΊΠ²ΠΎΡ Gemini ΠΎΠ±ΡΠΎΠ±Π»ΡΡΡΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ½ΠΈΠΌΠΈ fallback-ΠΌΠΎΠ΄Π΅Π»ΡΠΌΠΈ
## π API Reference
### Core Processing Pipeline
```python
from core.modules import StreamlineFileHandler, HFConversionEngine
from llm.gemini_connector import GeminiAnalysisEngine
# Initialize components
handler = StreamlineFileHandler(resource_manager)
engine = HFConversionEngine(resource_manager, config)
gemini = GeminiAnalysisEngine(gemini_config)
# Process document
file_result = await handler.process_upload(file_obj)
conversion_result = await engine.convert_stream(file_content, metadata)
analysis_result = await gemini.analyze_content(analysis_request)
```
### Visualization Generation
```python
from visualization.analytics_engine import InteractiveVisualizationEngine
viz_engine = InteractiveVisualizationEngine()
dashboard = viz_engine.create_quality_dashboard(conversion_result, analysis_result)
structure_viz = viz_engine.create_structural_analysis_viz(conversion_result)
```
## π Security & Privacy
### Data Handling
- **No Persistent Storage**: All processing in memory with automatic cleanup
- **API Key Security**: Keys stored locally, never transmitted to servers
- **File Privacy**: Temporary files automatically deleted after processing
- **Error Logging**: Sanitized logs without sensitive information
### Compliance Features
- **GDPR Ready**: No personal data retention
- **Enterprise Security**: Secure API integrations
- **Audit Trail**: Comprehensive processing logs
- **Access Control**: Environment-based configuration
## π€ Contributing
### Development Guidelines
1. **Code Style**: Follow PEP 8 with Black formatting
2. **Testing**: Comprehensive unit and integration tests
3. **Documentation**: Detailed docstrings and README updates
4. **Performance**: Memory-efficient and HF Spaces optimized
### Pull Request Process
1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open Pull Request
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- **Microsoft MarkItDown**: Core document conversion capabilities
- **Google Gemini**: Advanced AI analysis features
- **Hugging Face**: Platform hosting and community support
- **Plotly**: Interactive visualization framework
- **Gradio**: User interface framework
## π Support
### Getting Help
- **Documentation**: Comprehensive guides and examples
- **Issues**: [GitHub Issues](https://github.com/your-username/markitdown-testing-platform/issues)
- **Discussions**: [Community Forum](https://github.com/your-username/markitdown-testing-platform/discussions)
- **Email**: [email protected]
### Frequently Asked Questions
**Q: What's the maximum file size?**
A: 50MB for HF Spaces free tier. Larger files can be processed in local deployments.
**Q: Do I need a Gemini API key?**
A: No, basic conversion works without API key. Gemini key enables AI analysis features.
**Q: Can I process multiple files at once?**
A: Current version supports single-file processing. Batch processing available in advanced analytics.
**Q: How accurate are the quality scores?**
A: Scores are based on structural analysis and AI evaluation. Use as guidelines for optimization.
---
**Built with β€οΈ for enterprise document processing**
*Last updated: September 2025*
|