Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.0.1
title: MarkItDownTestingPlatform
emoji: 📊
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
short_description: Enterprise-Grade Document Conversion Testing with AI-Powered
🚀 MarkItDown Testing Platform
Enterprise-Grade Document Conversion Testing with AI-Powered Analysis
🎯 Overview
A comprehensive testing platform for Microsoft's MarkItDown document conversion tool, enhanced with Google Gemini AI analysis capabilities. Designed for enterprise-scale document processing workflows with focus on quality assessment and performance optimization.
✨ Key Features
- 🔄 Multi-Format Support: PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML
- 🤖 AI-Powered Analysis: Google Gemini integration for quality assessment
- 📊 Interactive Dashboards: Real-time visualization of conversion metrics
- 🏢 Enterprise-Ready: Scalable architecture with comprehensive error handling
- 💾 Export Capabilities: Multiple output formats for integration workflows
- 📈 Performance Monitoring: Detailed analytics and optimization insights
🚀 Quick Start
Using the Hugging Face Space
- Visit the Space: MarkItDown Testing Platform
- Upload Document: Drag & drop or select your document
- Configure Analysis: Enter Gemini API key for AI analysis (optional)
- Process: Click "Process Document" and review results
- Export: Download results in your preferred format
Getting Gemini API Key
- Visit Google AI Studio
- Create a new API key
- Copy and paste into the application
- Enjoy AI-powered document analysis!
📋 Supported File Formats
| Category | Formats | Notes |
|---|---|---|
| Documents | PDF, DOCX, PPTX, XLSX | Full structure preservation |
| Web Content | HTML, HTM | Complete formatting retention |
| Text Files | TXT, CSV, JSON, XML | Enhanced parsing capabilities |
| Rich Text | RTF | Advanced formatting support |
🏗️ Architecture Overview
┌─────────────────────────────────────────┐
│ Gradio Interface │
├─────────────────────────────────────────┤
│ File Upload │ Config │ Analysis │ Export│
├─────────────────────────────────────────┤
│ Processing Pipeline │
├─────────────────────────────────────────┤
│MarkItDown │ Gemini AI │ Visualization │
├─────────────────────────────────────────┤
│ Analytics & Reporting │
└─────────────────────────────────────────┘
Core Components
core/modules.py: Stateless processing engine optimized for HF Spacesllm/gemini_connector.py: Enterprise Gemini API integrationvisualization/analytics_engine.py: Interactive dashboard generationapp.py: Main Gradio application orchestration
🔧 Technical Specifications
System Requirements
- Python: 3.10+
- Memory: Optimized for HF Spaces (16GB limit)
- Storage: Stateless design with temporary file handling
- Processing: Async pipeline with resource management
Key Dependencies
gradio>=4.44.0 # Gradio interface (HF Spaces compatible)
markitdown[all]>=0.1.0 # Microsoft conversion engine
google-genai>=1.0.0 # Gemini integration (new client)
plotly>=5.17.0 # Interactive visualizations
pandas>=1.5.0 # Data processing
📊 Analysis Capabilities
Quality Metrics
- Structure Score: Heading, list, table preservation (0-10)
- Completeness Score: Information retention assessment (0-10)
- Accuracy Score: Formatting correctness evaluation (0-10)
- Readability Score: AI-friendly output optimization (0-10)
AI Analysis Types
- Quality Analysis: Comprehensive conversion assessment
- Structure Review: Document hierarchy and organization
- Content Summary: Thematic analysis and key insights
- Extraction Quality: Data preservation evaluation
Visualization Features
- Quality Dashboard: Multi-metric radar and performance charts
- Structure Analysis: Hierarchical document mapping
- Comparison Tools: Multi-document analysis capabilities
- Performance Timeline: Processing optimization insights
🎯 Use Cases
Enterprise Document Migration
- Legacy System Modernization: Convert historical documents to modern formats
- Content Management: Standardize document formats across organizations
- Compliance Documentation: Ensure consistent formatting for regulatory requirements
AI/ML Pipeline Integration
- RAG System Preparation: Optimize documents for retrieval systems
- Training Data Processing: Convert diverse formats for model training
- Content Analysis: Extract structured data from unstructured documents
Quality Assurance
- Conversion Validation: Verify accuracy of automated processing
- Performance Benchmarking: Compare different conversion approaches
- Error Detection: Identify and resolve processing issues
📈 Performance Optimization
HF Spaces Optimizations
- Memory Management: Automatic cleanup and resource monitoring
- Processing Limits: Smart file size and timeout management
- Async Processing: Non-blocking operations for better UX
- Error Recovery: Graceful degradation and retry mechanisms
Best Practices
- File Preparation: Use high-quality source documents
- API Management: Monitor Gemini API usage and limits
- Result Analysis: Review quality metrics for optimization opportunities
- Export Strategy: Choose appropriate formats for downstream processing
🛠️ Development Setup
Local Development
# Clone repository
git clone https://github.com/your-username/markitdown-testing-platform
cd markitdown-testing-platform
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Run application
python app.py
Environment Variables
# Optional: Set custom configurations
export GRADIO_TEMP_DIR="/tmp"
export MAX_FILE_SIZE="52428800" # 50MB in bytes
export PROCESSING_TIMEOUT="300" # 5 minutes
Deploying to Hugging Face Spaces
Створіть Space
- Відкрийте huggingface.co/spaces/new
- Оберіть SDK Gradio, назву
DocSA/MarkItDownTestingPlatform, runtime Python 3.11 app_fileмає залишатисяapp.py
Запуште код
git remote add hf https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform git push hf mainНалаштуйте секрети та змінні середовища
- Додайте секрет
GEMINI_API_KEY(Settings → Repository secrets → Add) - Додаткові змінні (не секретні):
MAX_FILE_SIZE_MB=50,PROCESSING_TIMEOUT=300,APP_VERSION=2.0.0-enterprise
- Додайте секрет
Особливості рантайму
- Gemini-аналіз вимкнений за замовчуванням; користувач активує його вручну
- Стандартні налаштування: тип аналізу Content Summary, модель Gemini 2.0 Flash
- Обмеження квот Gemini обробляються автоматичними fallback-моделями
📚 API Reference
Core Processing Pipeline
from core.modules import StreamlineFileHandler, HFConversionEngine
from llm.gemini_connector import GeminiAnalysisEngine
# Initialize components
handler = StreamlineFileHandler(resource_manager)
engine = HFConversionEngine(resource_manager, config)
gemini = GeminiAnalysisEngine(gemini_config)
# Process document
file_result = await handler.process_upload(file_obj)
conversion_result = await engine.convert_stream(file_content, metadata)
analysis_result = await gemini.analyze_content(analysis_request)
Visualization Generation
from visualization.analytics_engine import InteractiveVisualizationEngine
viz_engine = InteractiveVisualizationEngine()
dashboard = viz_engine.create_quality_dashboard(conversion_result, analysis_result)
structure_viz = viz_engine.create_structural_analysis_viz(conversion_result)
🔐 Security & Privacy
Data Handling
- No Persistent Storage: All processing in memory with automatic cleanup
- API Key Security: Keys stored locally, never transmitted to servers
- File Privacy: Temporary files automatically deleted after processing
- Error Logging: Sanitized logs without sensitive information
Compliance Features
- GDPR Ready: No personal data retention
- Enterprise Security: Secure API integrations
- Audit Trail: Comprehensive processing logs
- Access Control: Environment-based configuration
🤝 Contributing
Development Guidelines
- Code Style: Follow PEP 8 with Black formatting
- Testing: Comprehensive unit and integration tests
- Documentation: Detailed docstrings and README updates
- Performance: Memory-efficient and HF Spaces optimized
Pull Request Process
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Microsoft MarkItDown: Core document conversion capabilities
- Google Gemini: Advanced AI analysis features
- Hugging Face: Platform hosting and community support
- Plotly: Interactive visualization framework
- Gradio: User interface framework
📞 Support
Getting Help
- Documentation: Comprehensive guides and examples
- Issues: GitHub Issues
- Discussions: Community Forum
- Email: [email protected]
Frequently Asked Questions
Q: What's the maximum file size? A: 50MB for HF Spaces free tier. Larger files can be processed in local deployments.
Q: Do I need a Gemini API key? A: No, basic conversion works without API key. Gemini key enables AI analysis features.
Q: Can I process multiple files at once? A: Current version supports single-file processing. Batch processing available in advanced analytics.
Q: How accurate are the quality scores? A: Scores are based on structural analysis and AI evaluation. Use as guidelines for optimization.
Built with ❤️ for enterprise document processing
Last updated: September 2025