--- title: MarkItDownTestingPlatform emoji: πŸ“Š colorFrom: pink colorTo: gray sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false short_description: Enterprise-Grade Document Conversion Testing with AI-Powered --- # πŸš€ MarkItDown Testing Platform **Enterprise-Grade Document Conversion Testing with AI-Powered Analysis** [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) ## 🎯 Overview A comprehensive testing platform for Microsoft's MarkItDown document conversion tool, enhanced with Google Gemini AI analysis capabilities. Designed for enterprise-scale document processing workflows with focus on quality assessment and performance optimization. ### ✨ Key Features - **πŸ”„ Multi-Format Support**: PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML - **πŸ€– AI-Powered Analysis**: Google Gemini integration for quality assessment - **πŸ“Š Interactive Dashboards**: Real-time visualization of conversion metrics - **🏒 Enterprise-Ready**: Scalable architecture with comprehensive error handling - **πŸ’Ύ Export Capabilities**: Multiple output formats for integration workflows - **πŸ“ˆ Performance Monitoring**: Detailed analytics and optimization insights ## πŸš€ Quick Start ### Using the Hugging Face Space 1. **Visit the Space**: [MarkItDown Testing Platform](https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform) 2. **Upload Document**: Drag & drop or select your document 3. **Configure Analysis**: Enter Gemini API key for AI analysis (optional) 4. **Process**: Click "Process Document" and review results 5. **Export**: Download results in your preferred format ### Getting Gemini API Key 1. Visit [Google AI Studio](https://makersuite.google.com/app/apikey) 2. Create a new API key 3. Copy and paste into the application 4. Enjoy AI-powered document analysis! ## πŸ“‹ Supported File Formats | Category | Formats | Notes | |----------|---------|-------| | **Documents** | PDF, DOCX, PPTX, XLSX | Full structure preservation | | **Web Content** | HTML, HTM | Complete formatting retention | | **Text Files** | TXT, CSV, JSON, XML | Enhanced parsing capabilities | | **Rich Text** | RTF | Advanced formatting support | ## πŸ—οΈ Architecture Overview ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gradio Interface β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ File Upload β”‚ Config β”‚ Analysis β”‚ Exportβ”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Processing Pipeline β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚MarkItDown β”‚ Gemini AI β”‚ Visualization β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Analytics & Reporting β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Core Components - **`core/modules.py`**: Stateless processing engine optimized for HF Spaces - **`llm/gemini_connector.py`**: Enterprise Gemini API integration - **`visualization/analytics_engine.py`**: Interactive dashboard generation - **`app.py`**: Main Gradio application orchestration ## πŸ”§ Technical Specifications ### System Requirements - **Python**: 3.10+ - **Memory**: Optimized for HF Spaces (16GB limit) - **Storage**: Stateless design with temporary file handling - **Processing**: Async pipeline with resource management ### Key Dependencies ```python gradio>=4.44.0 # Gradio interface (HF Spaces compatible) markitdown[all]>=0.1.0 # Microsoft conversion engine google-genai>=1.0.0 # Gemini integration (new client) plotly>=5.17.0 # Interactive visualizations pandas>=1.5.0 # Data processing ``` ## πŸ“Š Analysis Capabilities ### Quality Metrics - **Structure Score**: Heading, list, table preservation (0-10) - **Completeness Score**: Information retention assessment (0-10) - **Accuracy Score**: Formatting correctness evaluation (0-10) - **Readability Score**: AI-friendly output optimization (0-10) ### AI Analysis Types - **Quality Analysis**: Comprehensive conversion assessment - **Structure Review**: Document hierarchy and organization - **Content Summary**: Thematic analysis and key insights - **Extraction Quality**: Data preservation evaluation ### Visualization Features - **Quality Dashboard**: Multi-metric radar and performance charts - **Structure Analysis**: Hierarchical document mapping - **Comparison Tools**: Multi-document analysis capabilities - **Performance Timeline**: Processing optimization insights ## 🎯 Use Cases ### Enterprise Document Migration - **Legacy System Modernization**: Convert historical documents to modern formats - **Content Management**: Standardize document formats across organizations - **Compliance Documentation**: Ensure consistent formatting for regulatory requirements ### AI/ML Pipeline Integration - **RAG System Preparation**: Optimize documents for retrieval systems - **Training Data Processing**: Convert diverse formats for model training - **Content Analysis**: Extract structured data from unstructured documents ### Quality Assurance - **Conversion Validation**: Verify accuracy of automated processing - **Performance Benchmarking**: Compare different conversion approaches - **Error Detection**: Identify and resolve processing issues ## πŸ“ˆ Performance Optimization ### HF Spaces Optimizations - **Memory Management**: Automatic cleanup and resource monitoring - **Processing Limits**: Smart file size and timeout management - **Async Processing**: Non-blocking operations for better UX - **Error Recovery**: Graceful degradation and retry mechanisms ### Best Practices - **File Preparation**: Use high-quality source documents - **API Management**: Monitor Gemini API usage and limits - **Result Analysis**: Review quality metrics for optimization opportunities - **Export Strategy**: Choose appropriate formats for downstream processing ## πŸ› οΈ Development Setup ### Local Development ```bash # Clone repository git clone https://github.com/your-username/markitdown-testing-platform cd markitdown-testing-platform # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt # Run application python app.py ``` ### Environment Variables ```bash # Optional: Set custom configurations export GRADIO_TEMP_DIR="/tmp" export MAX_FILE_SIZE="52428800" # 50MB in bytes export PROCESSING_TIMEOUT="300" # 5 minutes ``` ### Deploying to Hugging Face Spaces 1. **Π‘Ρ‚Π²ΠΎΡ€Ρ–Ρ‚ΡŒ Space** - Π’Ρ–Π΄ΠΊΡ€ΠΈΠΉΡ‚Π΅ [huggingface.co/spaces/new](https://huggingface.co/spaces/new) - ΠžΠ±Π΅Ρ€Ρ–Ρ‚ΡŒ SDK **Gradio**, Π½Π°Π·Π²Ρƒ `DocSA/MarkItDownTestingPlatform`, runtime **Python 3.11** - `app_file` ΠΌΠ°Ρ” Π·Π°Π»ΠΈΡˆΠ°Ρ‚ΠΈΡΡ `app.py` 2. **Π—Π°ΠΏΡƒΡˆΡ‚Π΅ ΠΊΠΎΠ΄** ```bash git remote add hf https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform git push hf main ``` 3. **ΠΠ°Π»Π°ΡˆΡ‚ΡƒΠΉΡ‚Π΅ сСкрСти Ρ‚Π° Π·ΠΌΡ–Π½Π½Ρ– сСрСдовища** - Π”ΠΎΠ΄Π°ΠΉΡ‚Π΅ сСкрСт `GEMINI_API_KEY` (Settings β†’ Repository secrets β†’ Add) - Π”ΠΎΠ΄Π°Ρ‚ΠΊΠΎΠ²Ρ– Π·ΠΌΡ–Π½Π½Ρ– (Π½Π΅ сСкрСтні): `MAX_FILE_SIZE_MB=50`, `PROCESSING_TIMEOUT=300`, `APP_VERSION=2.0.0-enterprise` 4. **ΠžΡΠΎΠ±Π»ΠΈΠ²ΠΎΡΡ‚Ρ– Ρ€Π°Π½Ρ‚Π°ΠΉΠΌΡƒ** - Gemini-Π°Π½Π°Π»Ρ–Π· Π²ΠΈΠΌΠΊΠ½Π΅Π½ΠΈΠΉ Π·Π° замовчуванням; користувач Π°ΠΊΡ‚ΠΈΠ²ΡƒΡ” ΠΉΠΎΠ³ΠΎ Π²Ρ€ΡƒΡ‡Π½Ρƒ - Π‘Ρ‚Π°Π½Π΄Π°Ρ€Ρ‚Π½Ρ– Π½Π°Π»Π°ΡˆΡ‚ΡƒΠ²Π°Π½Π½Ρ: Ρ‚ΠΈΠΏ Π°Π½Π°Π»Ρ–Π·Ρƒ **Content Summary**, модСль **Gemini 2.0 Flash** - ОбмСТСння ΠΊΠ²ΠΎΡ‚ Gemini ΠΎΠ±Ρ€ΠΎΠ±Π»ΡΡŽΡ‚ΡŒΡΡ Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΈΠΌΠΈ fallback-модСлями ## πŸ“š API Reference ### Core Processing Pipeline ```python from core.modules import StreamlineFileHandler, HFConversionEngine from llm.gemini_connector import GeminiAnalysisEngine # Initialize components handler = StreamlineFileHandler(resource_manager) engine = HFConversionEngine(resource_manager, config) gemini = GeminiAnalysisEngine(gemini_config) # Process document file_result = await handler.process_upload(file_obj) conversion_result = await engine.convert_stream(file_content, metadata) analysis_result = await gemini.analyze_content(analysis_request) ``` ### Visualization Generation ```python from visualization.analytics_engine import InteractiveVisualizationEngine viz_engine = InteractiveVisualizationEngine() dashboard = viz_engine.create_quality_dashboard(conversion_result, analysis_result) structure_viz = viz_engine.create_structural_analysis_viz(conversion_result) ``` ## πŸ” Security & Privacy ### Data Handling - **No Persistent Storage**: All processing in memory with automatic cleanup - **API Key Security**: Keys stored locally, never transmitted to servers - **File Privacy**: Temporary files automatically deleted after processing - **Error Logging**: Sanitized logs without sensitive information ### Compliance Features - **GDPR Ready**: No personal data retention - **Enterprise Security**: Secure API integrations - **Audit Trail**: Comprehensive processing logs - **Access Control**: Environment-based configuration ## 🀝 Contributing ### Development Guidelines 1. **Code Style**: Follow PEP 8 with Black formatting 2. **Testing**: Comprehensive unit and integration tests 3. **Documentation**: Detailed docstrings and README updates 4. **Performance**: Memory-efficient and HF Spaces optimized ### Pull Request Process 1. Fork the repository 2. Create feature branch (`git checkout -b feature/amazing-feature`) 3. Commit changes (`git commit -m 'Add amazing feature'`) 4. Push to branch (`git push origin feature/amazing-feature`) 5. Open Pull Request ## πŸ“„ License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## πŸ™ Acknowledgments - **Microsoft MarkItDown**: Core document conversion capabilities - **Google Gemini**: Advanced AI analysis features - **Hugging Face**: Platform hosting and community support - **Plotly**: Interactive visualization framework - **Gradio**: User interface framework ## πŸ“ž Support ### Getting Help - **Documentation**: Comprehensive guides and examples - **Issues**: [GitHub Issues](https://github.com/your-username/markitdown-testing-platform/issues) - **Discussions**: [Community Forum](https://github.com/your-username/markitdown-testing-platform/discussions) - **Email**: support@your-domain.com ### Frequently Asked Questions **Q: What's the maximum file size?** A: 50MB for HF Spaces free tier. Larger files can be processed in local deployments. **Q: Do I need a Gemini API key?** A: No, basic conversion works without API key. Gemini key enables AI analysis features. **Q: Can I process multiple files at once?** A: Current version supports single-file processing. Batch processing available in advanced analytics. **Q: How accurate are the quality scores?** A: Scores are based on structural analysis and AI evaluation. Use as guidelines for optimization. --- **Built with ❀️ for enterprise document processing** *Last updated: September 2025*