DocUA's picture
Оновлено метадані проекту MarkItDown, включаючи нову інформацію про версії SDK та Python. Додано інструкції для розгортання на Hugging Face Spaces, що включають налаштування секретів та змінних середовища. Змінено залежності у requirements.txt для відповідності новим версіям бібліотек.
8d119a4

A newer version of the Gradio SDK is available: 6.0.1

Upgrade
metadata
title: MarkItDownTestingPlatform
emoji: 📊
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
short_description: Enterprise-Grade Document Conversion Testing with AI-Powered

🚀 MarkItDown Testing Platform

Enterprise-Grade Document Conversion Testing with AI-Powered Analysis

Hugging Face Spaces Python 3.10+ License: MIT

🎯 Overview

A comprehensive testing platform for Microsoft's MarkItDown document conversion tool, enhanced with Google Gemini AI analysis capabilities. Designed for enterprise-scale document processing workflows with focus on quality assessment and performance optimization.

✨ Key Features

  • 🔄 Multi-Format Support: PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML
  • 🤖 AI-Powered Analysis: Google Gemini integration for quality assessment
  • 📊 Interactive Dashboards: Real-time visualization of conversion metrics
  • 🏢 Enterprise-Ready: Scalable architecture with comprehensive error handling
  • 💾 Export Capabilities: Multiple output formats for integration workflows
  • 📈 Performance Monitoring: Detailed analytics and optimization insights

🚀 Quick Start

Using the Hugging Face Space

  1. Visit the Space: MarkItDown Testing Platform
  2. Upload Document: Drag & drop or select your document
  3. Configure Analysis: Enter Gemini API key for AI analysis (optional)
  4. Process: Click "Process Document" and review results
  5. Export: Download results in your preferred format

Getting Gemini API Key

  1. Visit Google AI Studio
  2. Create a new API key
  3. Copy and paste into the application
  4. Enjoy AI-powered document analysis!

📋 Supported File Formats

Category Formats Notes
Documents PDF, DOCX, PPTX, XLSX Full structure preservation
Web Content HTML, HTM Complete formatting retention
Text Files TXT, CSV, JSON, XML Enhanced parsing capabilities
Rich Text RTF Advanced formatting support

🏗️ Architecture Overview

┌─────────────────────────────────────────┐
│           Gradio Interface              │
├─────────────────────────────────────────┤
│  File Upload │ Config │ Analysis │ Export│
├─────────────────────────────────────────┤
│         Processing Pipeline             │
├─────────────────────────────────────────┤
│MarkItDown │ Gemini AI │ Visualization  │
├─────────────────────────────────────────┤
│        Analytics & Reporting           │
└─────────────────────────────────────────┘

Core Components

  • core/modules.py: Stateless processing engine optimized for HF Spaces
  • llm/gemini_connector.py: Enterprise Gemini API integration
  • visualization/analytics_engine.py: Interactive dashboard generation
  • app.py: Main Gradio application orchestration

🔧 Technical Specifications

System Requirements

  • Python: 3.10+
  • Memory: Optimized for HF Spaces (16GB limit)
  • Storage: Stateless design with temporary file handling
  • Processing: Async pipeline with resource management

Key Dependencies

gradio>=4.44.0                  # Gradio interface (HF Spaces compatible)
markitdown[all]>=0.1.0          # Microsoft conversion engine
google-genai>=1.0.0             # Gemini integration (new client)
plotly>=5.17.0                  # Interactive visualizations
pandas>=1.5.0                   # Data processing

📊 Analysis Capabilities

Quality Metrics

  • Structure Score: Heading, list, table preservation (0-10)
  • Completeness Score: Information retention assessment (0-10)
  • Accuracy Score: Formatting correctness evaluation (0-10)
  • Readability Score: AI-friendly output optimization (0-10)

AI Analysis Types

  • Quality Analysis: Comprehensive conversion assessment
  • Structure Review: Document hierarchy and organization
  • Content Summary: Thematic analysis and key insights
  • Extraction Quality: Data preservation evaluation

Visualization Features

  • Quality Dashboard: Multi-metric radar and performance charts
  • Structure Analysis: Hierarchical document mapping
  • Comparison Tools: Multi-document analysis capabilities
  • Performance Timeline: Processing optimization insights

🎯 Use Cases

Enterprise Document Migration

  • Legacy System Modernization: Convert historical documents to modern formats
  • Content Management: Standardize document formats across organizations
  • Compliance Documentation: Ensure consistent formatting for regulatory requirements

AI/ML Pipeline Integration

  • RAG System Preparation: Optimize documents for retrieval systems
  • Training Data Processing: Convert diverse formats for model training
  • Content Analysis: Extract structured data from unstructured documents

Quality Assurance

  • Conversion Validation: Verify accuracy of automated processing
  • Performance Benchmarking: Compare different conversion approaches
  • Error Detection: Identify and resolve processing issues

📈 Performance Optimization

HF Spaces Optimizations

  • Memory Management: Automatic cleanup and resource monitoring
  • Processing Limits: Smart file size and timeout management
  • Async Processing: Non-blocking operations for better UX
  • Error Recovery: Graceful degradation and retry mechanisms

Best Practices

  • File Preparation: Use high-quality source documents
  • API Management: Monitor Gemini API usage and limits
  • Result Analysis: Review quality metrics for optimization opportunities
  • Export Strategy: Choose appropriate formats for downstream processing

🛠️ Development Setup

Local Development

# Clone repository
git clone https://github.com/your-username/markitdown-testing-platform
cd markitdown-testing-platform

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

# Run application
python app.py

Environment Variables

# Optional: Set custom configurations
export GRADIO_TEMP_DIR="/tmp"
export MAX_FILE_SIZE="52428800"  # 50MB in bytes
export PROCESSING_TIMEOUT="300"  # 5 minutes

Deploying to Hugging Face Spaces

  1. Створіть Space

    • Відкрийте huggingface.co/spaces/new
    • Оберіть SDK Gradio, назву DocSA/MarkItDownTestingPlatform, runtime Python 3.11
    • app_file має залишатися app.py
  2. Запуште код

    git remote add hf https://huggingface.co/spaces/DocSA/MarkItDownTestingPlatform
    git push hf main
    
  3. Налаштуйте секрети та змінні середовища

    • Додайте секрет GEMINI_API_KEY (Settings → Repository secrets → Add)
    • Додаткові змінні (не секретні): MAX_FILE_SIZE_MB=50, PROCESSING_TIMEOUT=300, APP_VERSION=2.0.0-enterprise
  4. Особливості рантайму

    • Gemini-аналіз вимкнений за замовчуванням; користувач активує його вручну
    • Стандартні налаштування: тип аналізу Content Summary, модель Gemini 2.0 Flash
    • Обмеження квот Gemini обробляються автоматичними fallback-моделями

📚 API Reference

Core Processing Pipeline

from core.modules import StreamlineFileHandler, HFConversionEngine
from llm.gemini_connector import GeminiAnalysisEngine

# Initialize components
handler = StreamlineFileHandler(resource_manager)
engine = HFConversionEngine(resource_manager, config)
gemini = GeminiAnalysisEngine(gemini_config)

# Process document
file_result = await handler.process_upload(file_obj)
conversion_result = await engine.convert_stream(file_content, metadata)
analysis_result = await gemini.analyze_content(analysis_request)

Visualization Generation

from visualization.analytics_engine import InteractiveVisualizationEngine

viz_engine = InteractiveVisualizationEngine()
dashboard = viz_engine.create_quality_dashboard(conversion_result, analysis_result)
structure_viz = viz_engine.create_structural_analysis_viz(conversion_result)

🔐 Security & Privacy

Data Handling

  • No Persistent Storage: All processing in memory with automatic cleanup
  • API Key Security: Keys stored locally, never transmitted to servers
  • File Privacy: Temporary files automatically deleted after processing
  • Error Logging: Sanitized logs without sensitive information

Compliance Features

  • GDPR Ready: No personal data retention
  • Enterprise Security: Secure API integrations
  • Audit Trail: Comprehensive processing logs
  • Access Control: Environment-based configuration

🤝 Contributing

Development Guidelines

  1. Code Style: Follow PEP 8 with Black formatting
  2. Testing: Comprehensive unit and integration tests
  3. Documentation: Detailed docstrings and README updates
  4. Performance: Memory-efficient and HF Spaces optimized

Pull Request Process

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Microsoft MarkItDown: Core document conversion capabilities
  • Google Gemini: Advanced AI analysis features
  • Hugging Face: Platform hosting and community support
  • Plotly: Interactive visualization framework
  • Gradio: User interface framework

📞 Support

Getting Help

Frequently Asked Questions

Q: What's the maximum file size? A: 50MB for HF Spaces free tier. Larger files can be processed in local deployments.

Q: Do I need a Gemini API key? A: No, basic conversion works without API key. Gemini key enables AI analysis features.

Q: Can I process multiple files at once? A: Current version supports single-file processing. Batch processing available in advanced analytics.

Q: How accurate are the quality scores? A: Scores are based on structural analysis and AI evaluation. Use as guidelines for optimization.


Built with ❤️ for enterprise document processing

Last updated: September 2025