Spaces:

DocSA
/

MarkItDownTestingPlatform

Sleeping

App Files Files Community

DocUA commited on Sep 19

Commit

8c0cf48

0 Parent(s):

Initial clean commit: purge history (remove .env and venv from history)

Browse files

Files changed (14) hide show

.DS_Store +0 -0
.gitignore +189 -0
AGENTS.md +534 -0
Dockerfile +43 -0
INSTRUCTION.md +593 -0
README.md +261 -0
app.py +1244 -0
core/modules.py +416 -0
examples/usage_examples.py +1159 -0
llm/gemini_connector.py +721 -0
requirements.txt +43 -0
spaces_metadata.yaml +77 -0
utils/deployment.py +609 -0
visualization/analytics_engine.py +1393 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.gitignore ADDED Viewed

	@@ -0,0 +1,189 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# PyInstaller
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# pyenv
+.python-version
+# celery beat schedule file
+celerybeat-schedule
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# VS Code
+.vscode/*
+!.vscode/settings.json
+!.vscode/tasks.json
+!.vscode/launch.json
+!.vscode/extensions.json
+# PyCharm
+.idea/
+*.iml
+*.ipr
+*.iws
+# macOS
+.DS_Store
+.AppleDouble
+.LSOverride
+# Windows
+Thumbs.db
+ehthumbs.db
+Desktop.ini
+$RECYCLE.BIN/
+# Logs and databases
+*.log
+*.sql
+*.sqlite
+# Local development settings
+.env.local
+.env.development.local
+.env.test.local
+.env.production.local
+# Local configuration
+config/local/
+# Temporary files
+*.swp
+*.swo
+*~
+# Project specific
+instance/
+.webassets-cache
+.pytest_cache/
+.coverage
+htmlcov/
+# Project dependencies
+node_modules/
+# Build files
+build/
+dist/
+*.egg-info/
+# Virtual Environment
+venv/
+env/
+# Jupyter Notebook
+.ipynb_checkpoints
+*/.ipynb_checkpoints/*
+# VS Code
+.vscode/
+!.vscode/settings.json
+!.vscode/tasks.json
+!.vscode/launch.json
+!.vscode/extensions.json
+# IDE specific files
+.idea/
+*.iml
+*.ipr
+*.iws
+# System Files
+.DS_Store
+Thumbs.db

AGENTS.md ADDED Viewed

	@@ -0,0 +1,534 @@

+# Strategic Architectural Revision: Hugging Face Optimized MarkItDown Platform
+## Core Design Philosophy Adaptation
+**"Simplicity scales better than sophistication on shared infrastructure"**
+### Revised Architectural Principles for HF Deployment:
+- **Stateless by Design**: Zero persistence complexity for shared hosting
+- **Memory-Efficient Processing**: Optimized for HF Spaces resource constraints
+- **Cloud-Native Integration**: Seamless Gemini API integration patterns
+- **Progressive Feature Disclosure**: Core functionality first, advanced features as additive layers
+## Phase 1: Simplified System Architecture
+### 🏗️ **HF-Optimized Architecture Overview**
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    GRADIO INTERFACE LAYER                  │
+│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
+│  │ Upload  │ │ Process │ │ Analyze │ │ Compare │ │ Export  │ │
+│  └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
+├─────────────────────────────────────────────────────────────┤
+│                 STATELESS PROCESSING LAYER                 │
+│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
+│  │ File Handler│ │ Conversion  │ │ LLM Gateway │           │
+│  │   Module    │ │   Engine    │ │  (Gemini)   │           │
+│  └─────────────┘ └─────────────┘ └─────────────┘           │
+├─────────────────────────────────────────────────────────────┤
+│                 IN-MEMORY STATE MANAGEMENT                 │
+│     Session Variables + Gradio State + Temp Storage        │
+└─────────────────────────────────────────────────────────────┘
+```
+### 🔧 **Simplified Core Modules**
+#### **1. Stateless File Handler**
+```python
+class StreamlineFileHandler:
+    """Memory-efficient, HF-optimized file processing"""
+    @staticmethod
+    def process_upload(file_obj):
+        """Direct stream processing without disk persistence"""
+        return {
+            'content': file_obj.read(),
+            'metadata': extract_minimal_metadata(file_obj),
+            'format': detect_format(file_obj.name)
+        }
+    @staticmethod
+    def validate_constraints(file_obj):
+        """HF Spaces resource-aware validation"""
+        # Max file size: 50MB for free tier
+        # Supported formats: PDF, DOCX, PPTX, TXT, HTML
+        pass
+```
+#### **2. Conversion Engine Adapter**
+```python
+class HFConversionEngine:
+    """MarkItDown wrapper optimized for stateless execution"""
+    def __init__(self):
+        self.md = MarkItDown()
+        self.temp_cleanup_queue = []
+    async def convert_stream(self, file_data, config=None):
+        """Stream-based conversion with automatic cleanup"""
+        try:
+            # Process in memory where possible
+            result = await self._process_with_cleanup(file_data)
+            return self._format_response(result)
+        finally:
+            self._cleanup_temp_files()
+```
+#### **3. Gemini LLM Gateway**
+```python
+class GeminiConnector:
+    """Streamlined Gemini API integration"""
+    def __init__(self, api_key=None):
+        self.client = self._init_gemini_client(api_key)
+        self.models = {
+            'analysis': 'gemini-1.5-pro',
+            'summary': 'gemini-1.5-flash',
+            'vision': 'gemini-1.5-pro-vision'
+        }
+    async def analyze_content(self, markdown_content, task_type='analysis'):
+        """Unified Gemini analysis interface"""
+        prompt = self._build_analysis_prompt(markdown_content, task_type)
+        response = await self.client.generate_content(
+            model=self.models[task_type],
+            contents=prompt
+        )
+        return self._parse_gemini_response(response)
+```
+## Phase 2: Gradio Interface Strategy
+### 📱 **HF Spaces Optimized UI Design**
+#### **Single-Page Progressive Enhancement:**
+```python
+def create_markitdown_interface():
+    """Main interface factory with progressive complexity"""
+    with gr.Blocks(
+        title="MarkItDown Testing Platform",
+        theme=gr.themes.Soft(),
+        css=custom_hf_styles
+    ) as interface:
+        # State management for stateless environment
+        session_state = gr.State({})
+        conversion_results = gr.State({})
+        with gr.Row():
+            with gr.Column(scale=1):
+                # LEFT: Input & Configuration
+                file_upload = gr.File(
+                    label="Upload Document",
+                    file_types=['.pdf', '.docx', '.pptx', '.txt', '.html'],
+                    type="binary"
+                )
+                # Gemini Configuration
+                with gr.Accordion("🔧 LLM Configuration", open=False):
+                    gemini_key = gr.Textbox(
+                        label="Gemini API Key",
+                        type="password",
+                        placeholder="Enter your Gemini API key..."
+                    )
+                    analysis_type = gr.Dropdown(
+                        choices=['Quality Analysis', 'Structure Review', 'Content Summary'],
+                        value='Quality Analysis',
+                        label="Analysis Type"
+                    )
+                process_btn = gr.Button(
+                    "🚀 Process Document",
+                    variant="primary",
+                    size="lg"
+                )
+            with gr.Column(scale=2):
+                # RIGHT: Results & Analysis
+                with gr.Tabs() as results_tabs:
+                    with gr.TabItem("📄 Conversion Results"):
+                        conversion_status = gr.HTML()
+                        with gr.Row():
+                            with gr.Column():
+                                gr.Markdown("### Original Preview")
+                                original_preview = gr.HTML()
+                            with gr.Column():
+                                gr.Markdown("### Markdown Output")
+                                markdown_output = gr.Code(
+                                    language="markdown",
+                                    show_label=False
+                                )
+                    with gr.TabItem("🤖 LLM Analysis"):
+                        analysis_status = gr.HTML()
+                        llm_analysis = gr.Markdown()
+                        # Analysis metrics visualization
+                        metrics_plot = gr.Plot()
+                    with gr.TabItem("📊 Comparison Dashboard"):
+                        quality_metrics = gr.JSON(label="Quality Metrics")
+                        # Interactive comparison
+                        comparison_viz = gr.HTML()
+                    with gr.TabItem("💾 Export Options"):
+                        export_format = gr.Dropdown(
+                            choices=['Markdown (.md)', 'HTML (.html)', 'JSON Report (.json)'],
+                            value='Markdown (.md)',
+                            label="Export Format"
+                        )
+                        export_btn = gr.Button("📥 Download Results")
+                        download_file = gr.File(visible=False)
+        # Event handlers with HF optimization
+        process_btn.click(
+            fn=process_document_pipeline,
+            inputs=[file_upload, gemini_key, analysis_type, session_state],
+            outputs=[conversion_status, markdown_output, original_preview, conversion_results],
+            show_progress=True
+        )
+    return interface
+```
+### 🔄 **Stateless Processing Pipeline**
+```python
+async def process_document_pipeline(file_obj, gemini_key, analysis_type, session_state):
+    """Main processing pipeline optimized for HF Spaces"""
+    pipeline_state = {
+        'timestamp': datetime.now().isoformat(),
+        'file_info': {},
+        'conversion_result': {},
+        'analysis_result': {},
+        'metrics': {}
+    }
+    try:
+        # Stage 1: File Processing
+        yield gr.HTML("🔄 Processing uploaded file..."), "", "", pipeline_state
+        file_handler = StreamlineFileHandler()
+        file_data = file_handler.process_upload(file_obj)
+        pipeline_state['file_info'] = file_data['metadata']
+        # Stage 2: MarkItDown Conversion
+        yield gr.HTML("🔄 Converting to Markdown..."), "", "", pipeline_state
+        converter = HFConversionEngine()
+        conversion_result = await converter.convert_stream(file_data)
+        pipeline_state['conversion_result'] = conversion_result
+        # Stage 3: Gemini Analysis (if API key provided)
+        if gemini_key and gemini_key.strip():
+            yield gr.HTML("🤖 Analyzing with Gemini..."), conversion_result['markdown'], "", pipeline_state
+            gemini = GeminiConnector(gemini_key)
+            analysis = await gemini.analyze_content(
+                conversion_result['markdown'],
+                analysis_type.lower().replace(' ', '_')
+            )
+            pipeline_state['analysis_result'] = analysis
+        # Stage 4: Generate Visualization Metrics
+        metrics = generate_quality_metrics(pipeline_state)
+        pipeline_state['metrics'] = metrics
+        # Final Results
+        yield (
+            gr.HTML("✅ Processing complete!"),
+            conversion_result['markdown'],
+            generate_original_preview(file_data),
+            pipeline_state
+        )
+    except Exception as e:
+        yield (
+            gr.HTML(f"❌ Error: {str(e)}"),
+            "",
+            "",
+            pipeline_state
+        )
+```
+## Phase 3: Gemini Integration Strategy
+### 🧠 **Multi-Model Gemini Architecture**
+```python
+class GeminiAnalysisEngine:
+    """Sophisticated Gemini-powered analysis system"""
+    ANALYSIS_PROMPTS = {
+        'quality_analysis': """
+        Analyze the quality of this Markdown conversion from a document.
+        Focus on:
+        1. Structure preservation (headers, lists, tables)
+        2. Content completeness
+        3. Formatting accuracy
+        4. Information hierarchy
+        Provide a structured analysis with scores (1-10) and recommendations.
+        """,
+        'structure_review': """
+        Review the structural elements of this converted Markdown document.
+        Identify:
+        1. Document hierarchy (H1, H2, H3, etc.)
+        2. Lists and their nesting
+        3. Tables and their formatting
+        4. Code blocks and special formatting
+        Create a structural map and quality assessment.
+        """,
+        'content_summary': """
+        Create a comprehensive summary of this document's content.
+        Include:
+        1. Main topics and themes
+        2. Key information points
+        3. Document purpose and audience
+        4. Content organization assessment
+        Provide both a brief summary and detailed breakdown.
+        """
+    }
+    async def comprehensive_analysis(self, markdown_content, analysis_types=['quality_analysis']):
+        """Execute multiple analysis types concurrently"""
+        tasks = []
+        for analysis_type in analysis_types:
+            task = self._single_analysis(markdown_content, analysis_type)
+            tasks.append(task)
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        return {
+            'analyses': dict(zip(analysis_types, results)),
+            'combined_score': self._calculate_combined_score(results),
+            'recommendations': self._generate_recommendations(results)
+        }
+```
+### 📊 **HF-Optimized Visualization Components**
+```python
+def create_analysis_visualization(analysis_results):
+    """Generate interactive visualizations for HF Spaces"""
+    import plotly.graph_objects as go
+    import plotly.express as px
+    # Quality Score Radar Chart
+    def quality_radar_chart(scores):
+        categories = ['Structure', 'Completeness', 'Accuracy', 'Readability']
+        fig = go.Figure()
+        fig.add_trace(go.Scatterpolar(
+            r=list(scores.values()),
+            theta=categories,
+            fill='toself',
+            name='Quality Metrics'
+        ))
+        fig.update_layout(
+            polar=dict(
+                radialaxis=dict(
+                    visible=True,
+                    range=[0, 10]
+                )),
+            showlegend=False,
+            title="Document Conversion Quality"
+        )
+        return fig
+    # Content Structure Tree
+    def structure_tree_viz(structure_data):
+        """Hierarchical document structure visualization"""
+        # Implementation for interactive document structure
+        pass
+    return {
+        'quality_chart': quality_radar_chart(analysis_results.get('scores', {})),
+        'structure_viz': structure_tree_viz(analysis_results.get('structure', {}))
+    }
+```
+## Phase 4: HF Deployment Optimization
+### 🚀 **Hugging Face Spaces Configuration**
+#### **requirements.txt (Optimized)**
+```txt
+gradio>=4.0.0
+markitdown[all]>=0.1.0
+google-generativeai>=0.3.0
+plotly>=5.0.0
+python-multipart>=0.0.6
+aiofiles>=22.0.0
+Pillow>=9.0.0
+# Lightweight alternatives for HF
+pandas>=1.3.0
+numpy>=1.21.0
+```
+#### **app.py (Entry Point)**
+```python
+import gradio as gr
+import asyncio
+import os
+from markitdown_platform import create_markitdown_interface
+# HF Spaces environment configuration
+def setup_hf_environment():
+    """Configure environment for HF Spaces deployment"""
+    # Set memory limits
+    os.environ['GRADIO_TEMP_DIR'] = '/tmp'
+    os.environ['MAX_FILE_SIZE'] = '50MB'  # HF free tier limit
+    # Optimize for HF infrastructure
+    gr.set_static_paths(paths=["./assets/"])
+def main():
+    """Main application entry point"""
+    setup_hf_environment()
+    # Create optimized interface
+    interface = create_markitdown_interface()
+    # HF Spaces optimized launch
+    interface.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,  # HF handles sharing
+        show_error=True,
+        max_file_size="50mb",
+        allowed_paths=["./temp/"],
+        show_tips=True,
+        enable_queue=True,
+        max_size=20  # Queue limit for free tier
+    )
+if __name__ == "__main__":
+    main()
+```
+### 🔧 **Resource Management Strategy**
+#### **Memory-Efficient Processing**
+```python
+class HFResourceManager:
+    """Resource management for HF Spaces constraints"""
+    MAX_MEMORY_MB = 16 * 1024  # 16GB limit for HF Spaces
+    MAX_FILE_SIZE_MB = 50
+    MAX_CONCURRENT_PROCESSES = 3
+    @classmethod
+    def check_resource_constraints(cls, file_size_mb, current_memory_usage):
+        """Validate resource availability before processing"""
+        if file_size_mb > cls.MAX_FILE_SIZE_MB:
+            raise ResourceError(f"File size {file_size_mb}MB exceeds limit {cls.MAX_FILE_SIZE_MB}MB")
+        if current_memory_usage > cls.MAX_MEMORY_MB * 0.8:  # 80% threshold
+            raise ResourceError("Insufficient memory available")
+        return True
+    @staticmethod
+    def cleanup_temp_resources():
+        """Aggressive cleanup for memory management"""
+        import gc
+        import tempfile
+        import shutil
+        # Force garbage collection
+        gc.collect()
+        # Clean temporary directories
+        temp_dir = tempfile.gettempdir()
+        for item in os.listdir(temp_dir):
+            if item.startswith('gradio_'):
+                shutil.rmtree(os.path.join(temp_dir, item), ignore_errors=True)
+```
+## Phase 5: Development Roadmap (HF-Optimized)
+### **Sprint 1: HF Foundation** (1 неділя)
+- Stateless architecture implementation
+- Basic Gradio interface with Gemini integration
+- File upload with HF constraints validation
+- Simple MarkItDown pipeline
+### **Sprint 2: Core Features** (1 неділя)
+- Multi-model Gemini analysis integration
+- Real-time processing with progress indicators
+- Basic visualization dashboard
+- Export functionality
+### **Sprint 3: Advanced Analysis** (1 неділя)
+- Comprehensive quality metrics
+- Interactive comparison tools
+- Advanced visualization components
+- Error handling and recovery
+### **Sprint 4: Polish & Optimization** (1 неділя)
+- HF Spaces performance optimization
+- UI/UX refinements
+- Resource management improvements
+- Documentation and examples
+## Success Metrics for HF Deployment
+### **Technical Performance:**
+- Cold start time < 30 seconds
+- Processing time < 2 minutes for 50MB files
+- Memory usage < 12GB peak
+- 99% uptime on HF infrastructure
+### **User Experience:**
+- Intuitive single-page workflow
+- Clear progress indication
+- Responsive design for mobile
+- Comprehensive error messaging
+### **Feature Adoption:**
+- Gemini analysis utilization rate
+- Export format preferences
+- Average session duration
+- User return rate
+---
+**Immediate Next Steps:**
+1. **Environment Setup**: Create HF Space and test basic deployment
+2. **Gemini Integration**: Implement and test API connectivity
+3. **Core Pipeline**: Build stateless processing architecture
+4. **UI Prototype**: Create basic Gradio interface with progressive enhancement
+**Key Architectural Decisions:**
+- ✅ **Stateless Design**: Eliminates persistence complexity
+- ✅ **Gemini Focus**: Single LLM provider for simplicity
+- ✅ **HF Optimization**: Resource-aware processing
+- ✅ **Progressive Enhancement**: Core features first, advanced features additive
+This revised architecture prioritizes **deployment simplicity** while maintaining **functional richness** - perfect for HF Spaces environment with Gemini integration.

Dockerfile ADDED Viewed

	@@ -0,0 +1,43 @@

+FROM python:3.11-slim-bookworm
+ARG DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    GRADIO_SERVER_NAME="0.0.0.0" \
+    GRADIO_SERVER_PORT=7860 \
+    HF_HOME="/tmp" \
+    GRADIO_TEMP_DIR="/tmp"
+WORKDIR /app
+# Копіюємо requirements окремо для кешування
+COPY requirements.txt .
+# Оновлення безпеки + тимчасові build-інструменти (видалимо після збірки)
+# Залишаємо тільки runtime-пакети: libmagic1, curl
+RUN apt-get update \
+    && apt-get upgrade -y \
+    && apt-get install -y --no-install-recommends \
+       gcc g++ make \
+       libmagic1 curl \
+    && pip install --no-cache-dir -r requirements.txt \
+    && apt-get purge -y gcc g++ make \
+    && apt-get autoremove -y \
+    && rm -rf /var/lib/apt/lists/*
+# Копіюємо застосунок
+COPY . .
+RUN mkdir -p /tmp && chmod 777 /tmp
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/ || exit 1
+RUN chmod +x /app/app.py
+EXPOSE 7860
+CMD ["python", "app.py"]

INSTRUCTION.md ADDED Viewed

	@@ -0,0 +1,593 @@

+# Керівництво Користувача: MarkItDown Testing Platform
+## Стратегічне Керівництво з Експлуатації Enterprise-системи
+**"Перетворюйте документи у структуровані дані з впевненістю підприємства"**
+---
+## Основна Філософія Платформи
+### Ключові Принципи Проектування
+- **Людиноорієнтований Інтерфейс**: Мінімізація когнітивного навантаження користувача
+- **Адаптивна Архітектура**: Система еволюціонує разом з вашими потребами
+- **Прозорість Процесу**: Кожен крок конвертації зрозумілий і контрольований
+- **Надійність Підприємства**: Промислова стабільність з елегантним дизайном
+---
+## Розділ 1: Стратегічний Огляд Можливостей
+### 🎯 **Основні Сценарії Використання**
+#### Корпоративна Міграція Документів
+- **Завдання**: Перетворення застарілих форматів у сучасні стандарти
+- **Підхід**: Автоматизована обробка з контролем якості
+- **Результат**: Стандартизована документообіг з AI-аналітикою
+#### Підготовка Даних для AI-систем
+- **Завдання**: Оптимізація документів для RAG (Retrieval-Augmented Generation)
+- **Підхід**: Структурований аналіз з оцінкою якості
+- **Результат**: AI-ready контент з метриками ефективності
+#### Контроль Якості Конвертації
+- **Завдання**: Валідація точності автоматичного перетворення
+- **Підхož**: Комплексна аналітика з детальними метриками
+- **Результат**: Довіра до процесу з аудиторським слідом
+---
+## Розділ 2: Покрокова Інструкція з Експлуатації
+### 🚀 **Етап 1: Початкова Конфігурація**
+#### Доступ до Платформи
+1. **Перейдіть на Hugging Face Space**: [MarkItDown Testing Platform](https://huggingface.co/spaces/your-username/markitdown-testing-platform)
+2. **Перевірте Системні Вимоги**:
+   - Сучасний браузер (Chrome, Firefox, Safari, Edge)
+   - Стабільне інтернет-з'єднання
+   - JavaScript увімкнений
+#### Отримання API-ключа Gemini (Опціонально)
+```
+Стратегічна Рекомендація:
+API-ключ Gemini розблоковує потужні AI-можливості аналізу,
+але базова конвертація працює без додаткових налаштувань
+```
+**Крок-за-кроком налаштування Gemini:**
+1. Відвідайте [Google AI Studio](https://makersuite.google.com/app/apikey)
+2. Створіть новий проект або оберіть існуючий
+3. Згенеруйте API-ключ з відповідними дозволами
+4. Скопіюйте ключ (зберігається локально, не передається на сервер)
+### 🔧 **Етап 2: Завантаження та Конфігурація Документа**
+#### Підтримувані Формати Файлів
+| Категорія | Формати | Особливості Обробки |
+|-----------|---------|-------------------|
+| **Офісні документи** | PDF, DOCX, PPTX, XLSX | Збереження структури та форматування |
+| **Веб-контент** | HTML, HTM | Повна підтримка CSS-стилів |
+| **Структуровані дані** | CSV, JSON, XML | Інтелектуальне парсингування |
+| **Текстові файли** | TXT, RTF | Розширена обробка кодувань |
+#### Процес Завантаження
+1. **Виберіть Вкладку "📁 Document Processing"**
+2. **Завантажте Файл**:
+   - Drag & Drop у область завантаження
+   - Або натисніть "Select Document" для вибору файлу
+   - **Ліміт**: 50MB для Hugging Face Spaces
+3. **Налаштуйте Параметри Обробки**:
+   ```
+   🔧 Стратегічні Рекомендації:
+   - Quality Analysis: Комплексна оцінка якості конвертації
+   - Structure Review: Фокус на збереження ієрархії документа
+   - Content Summary: Тематичний аналіз та ключові інсайти
+   - Extraction Quality: Оцінка збереження даних
+   ```
+4. **Виберіть AI-модель**:
+   - **Gemini 1.5 Pro**: Максимальна якість аналізу (рекомендовано)
+   - **Gemini 1.5 Flash**: Швидша обробка для великих обсягів
+### ⚡ **Етап 3: Виконання Обробки**
+#### Процес Конвертації
+1. **Натисніть "🚀 Process Document"**
+2. **Моніторинг Прогресу**:
+   - Реальний час відслідковування етапів
+   - Індикатори завантаження для кожної фази
+   - Автоматичні повідомлення про стан
+#### Етапи Обробки
+```
+Архітектурний Підхід до Прозорості:
+Кожен етап має чіткі межі відповідальності та точки контролю
+```
+**Фаза 1: Валідація Файлу**
+- Перевірка формату та цілісності
+- Аналіз безпеки та розміру
+- Метадані екстракція
+**Фаза 2: Конвертація в Markdown**
+- MarkItDown обробка з оптимізацією
+- Збереження структури та форматування
+- Генерація якісних метрик
+**Фаза 3: AI-аналіз (за наявності ключа)**
+- Gemini-powered інтелектуальний аналіз
+- Оцінка якості та рекомендації
+- Структурні та змістовні інсайти
+---
+## Розділ 3: Інтерпретація Результатів
+### 📊 **Розуміння Метрик Якості**
+#### Композитна Оцінка (0-10 балів)
+```
+Стратегічна Інтерпретація Оцінок:
+- 8.0-10.0: Відмінна якість, готово для продакшену
+- 6.0-7.9: Хороша якість, мінорні оптимізації
+- 4.0-5.9: Прийнятна якість, потребує покращень
+- 0.0-3.9: Потребує уваги, перевірте налаштування
+```
+#### Детальні Компоненти Оцінки
+**Структурна Оцінка (Structure Score)**
+- **Що вимірює**: Збереження заголовків, списків, таблиць
+- **Високі значення**: Документ зберіг логічну ієрархію
+- **Низькі значення**: Втрачено структурну організацію
+- **Дія**: Перевірте вхідний документ на чітку структуру
+**Повнота Контенту (Completeness Score)**
+- **Що вимірює**: Збереження інформації з оригіналу
+- **Високі значення**: Мінімальна втрата даних
+- **Низькі значення**: Значна втрата контенту
+- **Дія**: Розгляньте альтернативні налаштування конвертації
+**Точність Форматування (Accuracy Score)**
+- **Що вимірює**: Правильність передачі форматних елементів
+- **Високі значення**: Форматування відповідає оригіналу
+- **Низькі значення**: Спотворення або втрата форматування
+- **Дія**: Валідуйте критичні форматні елементи
+**Читабельність для AI (Readability Score)**
+- **Що вимірює**: Оптимізація для AI-споживання
+- **Високі значення**: Ідеальний для LLM обробки
+- **Низькі значення**: Потребує додаткової обробки
+- **Дія**: Розгляньте пост-процесинг оптимізації
+### 🤖 **AI-аналіз Результатів**
+#### Типи Аналізу та Їх Застосування
+**Quality Analysis (Аналіз Якості)**
+```markdown
+Практичне Застосування:
+- Валідація автоматичних процесів конвертації
+- Контроль якості для корпоративних пайплайнів
+- Оцінка готовності для downstream обробки
+```
+**Structure Review (Структурний Огляд)**
+```markdown
+Бізнес-цінність:
+- Забезпечення збереження документної ієрархії
+- Валідація навігаційної структури
+- Оптимізація для пошукових систем
+```
+**Content Summary (Змістовий Аналіз)**
+```markdown
+Страт��гічні Інсайти:
+- Розуміння тематичного навантаження документа
+- Ідентифікація ключових концепцій
+- Підготовка для content management систем
+```
+---
+## Розділ 4: Візуалізація та Аналітика
+### 📈 **Навігація Dashboard'ом**
+#### Вкладка "📊 Analysis Dashboard"
+**Quality Overview (Загальний Огляд Якості)**
+- **Gauge Chart**: Композитна оцінка з візуальними індикаторами
+- **Інтерпретація**: Швидка оцінка успішності конвертації
+- **Використання**: Executive summary для стейкхолдерів
+**Detailed Breakdown (Детальна Аналітика)**
+- **Radar Chart**: Багатомірний аналіз якісних показників
+- **Застосування**: Ідентифікація сильних та слабких сторін
+- **Оптимізація**: Фокус на найнижчих показниках
+**Document Structure (Структура Документа)**
+- **Treemap**: Ієрархічна візуалізація елементів
+- **Bar Charts**: Розподіл структурних компонентів
+- **Insights**: Розуміння організаційної логіки
+#### Інтерактивні Можливості
+```
+Архітектурний Підхід до UX:
+Кожен візуальний елемент забезпечує actionable insights
+з можливістю drill-down до деталей
+```
+- **Hover Effects**: Детальна інформація при наведенні
+- **Zoom Functionality**: Масштабування для деталізації
+- **Export Options**: Збереження візуалізацій у різних форматах
+---
+## Розділ 5: Експорт та Інтеграція
+### 💾 **Стратегії Збереження Результатів**
+#### Формати Експорту та Їх Застосування
+**Markdown (.md)**
+```markdown
+Стратегічне Застосування:
+- Інтеграція з Git-based workflows
+- Подача в LLM для подальшої обробки
+- Documentation-as-Code процеси
+```
+**HTML Report (.html)**
+```html
+Бізнес-цінність:
+- Презентація для non-technical стейкхолдерів
+- Архівування з візуальним контекстом
+- Web-based sharing та collaboration
+```
+**JSON Data (.json)**
+```json
+Технічна Інтеграція:
+- API-based інтеграція з downstream системами
+- Метадані для автоматизованих пайплайнів
+- Structured data для аналітичних платформ
+```
+**Complete Package (.zip)**
+```
+Enterprise Approach:
+- Comprehensive backup з усіма артефактами
+- Audit trail для compliance процесів
+- Self-contained delivery package
+```
+#### Процес Експорту
+1. **Перейдіть на "💾 Export & History"**
+2. **Оберіть Формат**: Базуючись на downstream requirements
+3. **Налаштуйте Опції**:
+   - Original Document Preview
+   - AI Analysis Results
+   - Quality Metrics
+   - Visualizations
+   - Processing Logs
+4. **Генерація та Завантаження**:
+   - Натисніть "📥 Generate Export"
+   - Дочекайтесь completion notification
+   - Завантажте через browser download
+---
+## Розділ 6: Розширене Використання
+### 🔍 **Advanced Analytics (Розширена Аналітика)**
+#### Порівняльний Аналіз
+```
+Стратегічний Підхід до Batch Processing:
+Можливість порівняння ефективності конвертації
+для різних типів документів та налаштувань
+```
+**Workflow для Comparative Analysis**:
+1. Завантажте кілька документів через "🔍 Advanced Analytics"
+2. Оберіть аналітичні опції:
+   - Performance Timeline
+   - Quality Trends
+   - Batch Statistics
+   - Resource Usage
+3. Генеруйте порівняльні звіти з actionable insights
+#### Performance Monitoring
+- **Processing Speed Trends**: Моніторинг швидкості обробки
+- **Quality Consistency**: Стабільність якісних показників
+- **Resource Utilization**: Ефективність використання ресурсів
+- **Error Pattern Analysis**: Ідентифікація проблемних сценаріїв
+### ⚙️ **System Status та Моніторинг**
+#### Health Check Dashboard
+```json
+Operational Excellence Metrics:
+{
+  "system_health": "Healthy/Degraded/Unhealthy",
+  "processing_capacity": "Available/Limited/Exhausted",
+  "api_connectivity": "Connected/Intermittent/Offline",
+  "cache_efficiency": "Percentage hit rate"
+}
+```
+**Інтерпретація Статусів**:
+- **Healthy**: Система функціонує оптимально
+- **Degraded**: Зниження продуктивності, але функціональна
+- **Unhealthy**: Потребує втручання або troubleshooting
+---
+## Розділ 7: Troubleshooting та Оптимізація
+### 🔧 **Поширені Сценарії та Рішення**
+#### Проблеми з Конвертацією
+**Симптом**: Низька якість конвертації PDF
+```
+Діагностичний Підхід:
+1. Перевірте, чи PDF містить текстовий шар (не тільки зображення)
+2. Розгляньте Azure Document Intelligence інтеграцію
+3. Тестуйте з різними density настройками
+```
+**Рішення**:
+- Використайте OCR preprocessing для scan-based PDF
+- Налаштуйте Azure endpoint для складних документів
+- Розбийте великі PDF на секції
+**Симптом**: Тайм-аут обробки
+```
+Resource Management Strategy:
+- HF Spaces має 5-хвилинний ліміт обробки
+- Файли >20MB потребують особливої уваги
+- Concurrent processing може створювати bottlenecks
+```
+**Рішення**:
+- Розбийте великі документи на менші частини
+- Оптимізуйте час обробки, відключивши AI-аналіз для тестування
+- Використайте локальне розгортання для великих workloads
+#### API та Конфігурація
+**Симптом**: Gemini API помилки
+```
+Authentication та Rate Limiting:
+- Перевірте валідність API ключа
+- Моніторьте usage limits у Google Console
+- Налаштуйте retry logic для intermittent failures
+```
+**Рішення**:
+- Регенерація API ключа в Google AI Studio
+- Перевірка квот та billing status
+- Використання різних моделей для балансування навантаження
+### 📈 **Оптимізація Продуктивності**
+#### Стратегії для Великих Обсягів
+**Batch Processing Approach**:
+```python
+# Псевдо-код для оптимальної batch стратегії
+documents = preprocess_and_prioritize(document_list)
+for batch in chunk_documents(documents, optimal_size=5):
+    results = process_batch_with_monitoring(batch)
+    validate_and_store_results(results)
+```
+**Resource Optimization**:
+- Використовуйте Gemini Flash для швидкої обробки
+- Кешуйте результати для repeated processing
+- Моніторьте system health між batch операціями
+---
+## Розділ 8: Інтеграція та Автоматизація
+### 🔗 **Enterprise Integration Patterns**
+#### API-based Integration
+```python
+# Приклад інтеграції через programmatic access
+def integrate_with_existing_pipeline(document_path):
+    # Використання core components напряму
+    from markitdown_platform import DocumentProcessingOrchestrator
+    orchestrator = DocumentProcessingOrchestrator(...)
+    request = ProcessingRequest.from_file(document_path)
+    result = await orchestrator.process_document(request)
+    return standardize_output_format(result)
+```
+#### Workflow Automation
+```
+Strategic Automation Framework:
+1. Document Ingestion (Watch folders, S3 triggers, API endpoints)
+2. Quality Gates (Automated validation based on metrics)
+3. Routing Logic (Different pipelines based on document type)
+4. Notification Systems (Slack, email, webhooks for completion)
+```
+#### CI/CD Integration
+- **Quality Checks**: Automated validation у deployment pipelines
+- **Regression Testing**: Consistency перевірка across versions
+- **Performance Benchmarks**: SLA enforcement через automated tests
+---
+## Розділ 9: Безпека та Compliance
+### 🔒 **Data Security Framework**
+#### Privacy Protection Strategy
+```
+GDPR-Compliant Architecture:
+- No persistent storage of user documents
+- API keys stored locally, never transmitted
+- Automatic cleanup of temporary processing files
+- Audit trails without sensitive data exposure
+```
+#### Security Best Practices
+1. **API Key Management**:
+   - Rotate ключі регулярно
+   - Не зберігайте ключі у коді
+   - Використовуйте environment variables
+2. **Document Handling**:
+   - Валідація file signatures
+   - Size та format restrictions
+   - Automatic sanitization suspicious content
+3. **Network Security**:
+   - HTTPS-only communications
+   - Certificate pinning where applicable
+   - Rate limiting та DDoS protection
+### 📋 **Compliance Considerations**
+#### Audit Trail Management
+- **Processing Logs**: Comprehensive logging без sensitive data
+- **Quality Metrics**: Historical tracking for compliance reporting
+- **System Health**: Operational metrics для SLA validation
+- **User Actions**: Anonymized usage analytics
+---
+## Розділ 10: Майбутній Розвиток та Roadmap
+### 🔮 **Стратегічні Напрямки Розвитку**
+#### Короткострокові Покращення (3-6 місяців)
+- **Enhanced Batch Processing**: Більш ефективна multi-document обробка
+- **Advanced Comparison Tools**: Side-by-side analysis capabilities
+- **Custom Template Support**: User-defined output formatting
+- **Performance Dashboards**: Real-time operational metrics
+#### Довгострокова Візія (6-18 місяців)
+```
+Architectural Evolution Path:
+- Multi-LLM Support: Claude, OpenAI, local models
+- Plugin Ecosystem: Third-party extensions framework
+- Advanced Analytics: ML-powered quality prediction
+- Enterprise SSO: Active Directory, OAuth integration
+```
+#### Community та Ecosystem
+- **Open Source Contributions**: Community-driven improvements
+- **Integration Partners**: Partnerships з document management vendors
+- **Training Programs**: Certification для enterprise users
+- **Support Tiers**: SLA-backed support для enterprise deployments
+---
+## Додаток A: Технічні Специфікації
+### 📋 **Системні Вимоги**
+#### Browser Compatibility
+| Browser | Minimum Version | Recommended |
+|---------|----------------|-------------|
+| Chrome | 90+ | Latest |
+| Firefox | 88+ | Latest |
+| Safari | 14+ | Latest |
+| Edge | 90+ | Latest |
+#### File Format Support Matrix
+| Format | Max Size | Special Notes |
+|--------|----------|---------------|
+| PDF | 50MB | Text-based preferred, OCR available |
+| DOCX | 50MB | Full formatting preservation |
+| PPTX | 50MB | Slide structure maintained |
+| XLSX | 50MB | Table structure optimized |
+| HTML | 20MB | CSS styling preserved |
+| TXT | 10MB | Encoding auto-detection |
+### 🔧 **Advanced Configuration Options**
+#### Environment Variables (for Local Deployment)
+```bash
+# Core Configuration
+MAX_FILE_SIZE_MB=50
+PROCESSING_TIMEOUT_SECONDS=300
+ENABLE_DEBUG_LOGGING=false
+# AI Integration
+GEMINI_DEFAULT_MODEL=gemini-1.5-pro
+AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=your-endpoint
+# Performance Tuning
+CACHE_TTL_HOURS=24
+MAX_CONCURRENT_PROCESSES=3
+MEMORY_LIMIT_GB=12
+```
+---
+## Додаток B: Часті Питання (FAQ)
+### ❓ **Загальні Питання**
+**Q: Чи потрібен Gemini API ключ для роботи?**
+A: Ні, базова конвертація документів працює без API ключа. Gemini потрібен тільки для AI-powered аналізу та рекомендацій.
+**Q: Які обмеження розміру файлів?**
+A: HF Spaces free tier обмежує файли до 50MB. Для більших файлів використовуйте локальне розгортання або розбийте документ на частини.
+**Q: Чи зберігаються мої документи на сервері?**
+A: Ні, усі документи обробляються в пам'яті і автоматично видаляються після завершення. Платформа designed для privacy-first обробки.
+**Q: Як інтерпретувати оцінки якості?**
+A: Оцінки 0-10: 8+ відмінно, 6-8 добре, 4-6 прийнятно, <4 потребує уваги. Фокусуйтеся на найнижчих компонентах для покращення.
+### 🔧 **Технічні Питання**
+**Q: Чи можна інтегрувати з існуючими системами?**
+A: Так, платформа побудована з modular architecture що дозволяє integration через API або direct component usage.
+**Q: Які формати експорту доступні?**
+A: Markdown, HTML, JSON, PDF звіти, та ZIP packages з усіма артефактами.
+**Q: Чи підтримується batch processing?**
+A: Так, через Advanced Analytics tab можна обробляти кілька документів одночасно з порівняльним аналізом.
+---
+## Контакти та Підтримка
+### 📞 **Канали Підтримки**
+**Документація та Ресурси:**
+- [GitHub Repository](https://github.com/your-username/markitdown-testing-platform)
+- [Technical Documentation](https://docs.your-domain.com)
+- [Community Forum](https://github.com/your-username/markitdown-testing-platform/discussions)
+**Зворотний Зв'язок:**
+- [Issue Tracker](https://github.com/your-username/markitdown-testing-platform/issues) для bug reports
+- [Feature Requests](https://github.com/your-username/markitdown-testing-platform/discussions) для нових можливостей
+- Email: [email protected] для enterprise inquiries
+**Community:**
+- [Discord Channel](https://discord.gg/your-channel) для real-time discussion
+- [LinkedIn Group](https://linkedin.com/groups/your-group) для professional networking
+- [YouTube Channel](https://youtube.com/your-channel) для video tutorials
+---
+**Версія документа**: 2.0.0 | **Остання редакція**: Вересень 2025
+*Це керівництво відображає current state платформи та буде оновлюватися з новими features та improvements.*

README.md ADDED Viewed

	@@ -0,0 +1,261 @@

+# 🚀 MarkItDown Testing Platform
+**Enterprise-Grade Document Conversion Testing with AI-Powered Analysis**
+[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/your-username/markitdown-testing-platform)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+## 🎯 Overview
+A comprehensive testing platform for Microsoft's MarkItDown document conversion tool, enhanced with Google Gemini AI analysis capabilities. Designed for enterprise-scale document processing workflows with focus on quality assessment and performance optimization.
+### ✨ Key Features
+- **🔄 Multi-Format Support**: PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML
+- **🤖 AI-Powered Analysis**: Google Gemini integration for quality assessment
+- **📊 Interactive Dashboards**: Real-time visualization of conversion metrics
+- **🏢 Enterprise-Ready**: Scalable architecture with comprehensive error handling
+- **💾 Export Capabilities**: Multiple output formats for integration workflows
+- **📈 Performance Monitoring**: Detailed analytics and optimization insights
+## 🚀 Quick Start
+### Using the Hugging Face Space
+1. **Visit the Space**: [MarkItDown Testing Platform](https://huggingface.co/spaces/your-username/markitdown-testing-platform)
+2. **Upload Document**: Drag & drop or select your document
+3. **Configure Analysis**: Enter Gemini API key for AI analysis (optional)
+4. **Process**: Click "Process Document" and review results
+5. **Export**: Download results in your preferred format
+### Getting Gemini API Key
+1. Visit [Google AI Studio](https://makersuite.google.com/app/apikey)
+2. Create a new API key
+3. Copy and paste into the application
+4. Enjoy AI-powered document analysis!
+## 📋 Supported File Formats
+| Category | Formats | Notes |
+|----------|---------|-------|
+| **Documents** | PDF, DOCX, PPTX, XLSX | Full structure preservation |
+| **Web Content** | HTML, HTM | Complete formatting retention |
+| **Text Files** | TXT, CSV, JSON, XML | Enhanced parsing capabilities |
+| **Rich Text** | RTF | Advanced formatting support |
+## 🏗️ Architecture Overview
+```
+┌─────────────────────────────────────────┐
+│           Gradio Interface              │
+├─────────────────────────────────────────┤
+│  File Upload │ Config │ Analysis │ Export│
+├─────────────────────────────────────────┤
+│         Processing Pipeline             │
+├─────────────────────────────────────────┤
+│MarkItDown │ Gemini AI │ Visualization  │
+├─────────────────────────────────────────┤
+│        Analytics & Reporting           │
+└─────────────────────────────────────────┘
+```
+### Core Components
+- **`core/modules.py`**: Stateless processing engine optimized for HF Spaces
+- **`llm/gemini_connector.py`**: Enterprise Gemini API integration
+- **`visualization/analytics_engine.py`**: Interactive dashboard generation
+- **`app.py`**: Main Gradio application orchestration
+## 🔧 Technical Specifications
+### System Requirements
+- **Python**: 3.10+
+- **Memory**: Optimized for HF Spaces (16GB limit)
+- **Storage**: Stateless design with temporary file handling
+- **Processing**: Async pipeline with resource management
+### Key Dependencies
+```python
+gradio>=4.0.0                    # UI framework
+markitdown[all]>=0.1.0         # Document conversion
+google-generativeai>=0.3.0       # Gemini integration
+plotly>=5.17.0                   # Interactive visualizations
+pandas>=1.5.0                    # Data processing
+```
+## 📊 Analysis Capabilities
+### Quality Metrics
+- **Structure Score**: Heading, list, table preservation (0-10)
+- **Completeness Score**: Information retention assessment (0-10)
+- **Accuracy Score**: Formatting correctness evaluation (0-10)
+- **Readability Score**: AI-friendly output optimization (0-10)
+### AI Analysis Types
+- **Quality Analysis**: Comprehensive conversion assessment
+- **Structure Review**: Document hierarchy and organization
+- **Content Summary**: Thematic analysis and key insights
+- **Extraction Quality**: Data preservation evaluation
+### Visualization Features
+- **Quality Dashboard**: Multi-metric radar and performance charts
+- **Structure Analysis**: Hierarchical document mapping
+- **Comparison Tools**: Multi-document analysis capabilities
+- **Performance Timeline**: Processing optimization insights
+## 🎯 Use Cases
+### Enterprise Document Migration
+- **Legacy System Modernization**: Convert historical documents to modern formats
+- **Content Management**: Standardize document formats across organizations
+- **Compliance Documentation**: Ensure consistent formatting for regulatory requirements
+### AI/ML Pipeline Integration
+- **RAG System Preparation**: Optimize documents for retrieval systems
+- **Training Data Processing**: Convert diverse formats for model training
+- **Content Analysis**: Extract structured data from unstructured documents
+### Quality Assurance
+- **Conversion Validation**: Verify accuracy of automated processing
+- **Performance Benchmarking**: Compare different conversion approaches
+- **Error Detection**: Identify and resolve processing issues
+## 📈 Performance Optimization
+### HF Spaces Optimizations
+- **Memory Management**: Automatic cleanup and resource monitoring
+- **Processing Limits**: Smart file size and timeout management
+- **Async Processing**: Non-blocking operations for better UX
+- **Error Recovery**: Graceful degradation and retry mechanisms
+### Best Practices
+- **File Preparation**: Use high-quality source documents
+- **API Management**: Monitor Gemini API usage and limits
+- **Result Analysis**: Review quality metrics for optimization opportunities
+- **Export Strategy**: Choose appropriate formats for downstream processing
+## 🛠️ Development Setup
+### Local Development
+```bash
+# Clone repository
+git clone https://github.com/your-username/markitdown-testing-platform
+cd markitdown-testing-platform
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate  # Linux/Mac
+# venv\Scripts\activate   # Windows
+# Install dependencies
+pip install -r requirements.txt
+# Run application
+python app.py
+```
+### Environment Variables
+```bash
+# Optional: Set custom configurations
+export GRADIO_TEMP_DIR="/tmp"
+export MAX_FILE_SIZE="52428800"  # 50MB in bytes
+export PROCESSING_TIMEOUT="300"  # 5 minutes
+```
+## 📚 API Reference
+### Core Processing Pipeline
+```python
+from core.modules import StreamlineFileHandler, HFConversionEngine
+from llm.gemini_connector import GeminiAnalysisEngine
+# Initialize components
+handler = StreamlineFileHandler(resource_manager)
+engine = HFConversionEngine(resource_manager, config)
+gemini = GeminiAnalysisEngine(gemini_config)
+# Process document
+file_result = await handler.process_upload(file_obj)
+conversion_result = await engine.convert_stream(file_content, metadata)
+analysis_result = await gemini.analyze_content(analysis_request)
+```
+### Visualization Generation
+```python
+from visualization.analytics_engine import InteractiveVisualizationEngine
+viz_engine = InteractiveVisualizationEngine()
+dashboard = viz_engine.create_quality_dashboard(conversion_result, analysis_result)
+structure_viz = viz_engine.create_structural_analysis_viz(conversion_result)
+```
+## 🔐 Security & Privacy
+### Data Handling
+- **No Persistent Storage**: All processing in memory with automatic cleanup
+- **API Key Security**: Keys stored locally, never transmitted to servers
+- **File Privacy**: Temporary files automatically deleted after processing
+- **Error Logging**: Sanitized logs without sensitive information
+### Compliance Features
+- **GDPR Ready**: No personal data retention
+- **Enterprise Security**: Secure API integrations
+- **Audit Trail**: Comprehensive processing logs
+- **Access Control**: Environment-based configuration
+## 🤝 Contributing
+### Development Guidelines
+1. **Code Style**: Follow PEP 8 with Black formatting
+2. **Testing**: Comprehensive unit and integration tests
+3. **Documentation**: Detailed docstrings and README updates
+4. **Performance**: Memory-efficient and HF Spaces optimized
+### Pull Request Process
+1. Fork the repository
+2. Create feature branch (`git checkout -b feature/amazing-feature`)
+3. Commit changes (`git commit -m 'Add amazing feature'`)
+4. Push to branch (`git push origin feature/amazing-feature`)
+5. Open Pull Request
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+- **Microsoft MarkItDown**: Core document conversion capabilities
+- **Google Gemini**: Advanced AI analysis features
+- **Hugging Face**: Platform hosting and community support
+- **Plotly**: Interactive visualization framework
+- **Gradio**: User interface framework
+## 📞 Support
+### Getting Help
+- **Documentation**: Comprehensive guides and examples
+- **Issues**: [GitHub Issues](https://github.com/your-username/markitdown-testing-platform/issues)
+- **Discussions**: [Community Forum](https://github.com/your-username/markitdown-testing-platform/discussions)
+- **Email**: [email protected]
+### Frequently Asked Questions
+**Q: What's the maximum file size?**
+A: 50MB for HF Spaces free tier. Larger files can be processed in local deployments.
+**Q: Do I need a Gemini API key?**
+A: No, basic conversion works without API key. Gemini key enables AI analysis features.
+**Q: Can I process multiple files at once?**
+A: Current version supports single-file processing. Batch processing available in advanced analytics.
+**Q: How accurate are the quality scores?**
+A: Scores are based on structural analysis and AI evaluation. Use as guidelines for optimization.
+---
+**Built with ❤️ for enterprise document processing**
+*Last updated: September 2025*

app.py ADDED Viewed

	@@ -0,0 +1,1244 @@

+"""
+MarkItDown Testing Platform - Enterprise Architecture Implementation
+Strategic Design Philosophy:
+"Complexity is the enemy of reliable software"
+Core Architectural Principles:
+- Minimize cognitive load for developers
+- Create self-documenting, modular interfaces
+- Design for future adaptability
+- Prioritize human understanding over technical complexity
+This implementation demonstrates enterprise-grade architectural patterns
+optimized for long-term maintainability and team collaboration.
+"""
+import os
+import asyncio
+import json
+import logging
+from datetime import datetime
+from typing import Dict, Optional, List, Tuple, Protocol, Any
+from dataclasses import dataclass, field
+from abc import ABC, abstractmethod
+import gradio as gr
+from pathlib import Path
+from pydantic import JsonValue
+# Strategic import organization - dependency layers clearly defined
+from core.modules import (
+    StreamlineFileHandler, HFConversionEngine, ResourceManager,
+    ProcessingConfig, ProcessingResult
+)
+from llm.gemini_connector import (
+    GeminiAnalysisEngine, GeminiConnectionManager, GeminiConfig,
+    AnalysisRequest, AnalysisType, GeminiModel
+)
+from visualization.analytics_engine import (
+    InteractiveVisualizationEngine, QualityMetricsCalculator,
+    VisualizationConfig, ReportGenerator
+)
+# Configure enterprise-grade logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - [%(funcName)s:%(lineno)d] - %(message)s'
+)
+logger = logging.getLogger(__name__)
+# ==================== SERIALIZABLE TYPE DEFINITIONS ====================
+JSONDict = Dict[str, JsonValue]
+if hasattr(gr.Blocks, "get_api_info"):
+    def _suppress_api_info(self):
+        return {"named_endpoints": {}, "unnamed_endpoints": []}
+    gr.Blocks.get_api_info = _suppress_api_info
+# ==================== STRATEGIC DATA MODELS ====================
+@dataclass(frozen=True)
+class ProcessingRequest:
+    """Immutable request container - eliminates parameter coupling"""
+    file_content: bytes
+    file_metadata: JSONDict
+    gemini_api_key: Optional[str] = None
+    analysis_type: str = "quality_analysis"
+    model_preference: str = "gemini-1.5-pro"
+    enable_plugins: bool = False
+    azure_endpoint: Optional[str] = None
+    session_context: JSONDict = field(default_factory=dict)
+@dataclass(frozen=True)
+class ProcessingResponse:
+    """Standardized response container - predictable interface"""
+    success: bool
+    conversion_result: Optional[ProcessingResult]
+    analysis_result: Optional[Any]
+    quality_metrics: JSONDict
+    error_details: Optional[str]
+    processing_metadata: JSONDict
+    @classmethod
+    def success_response(
+        cls,
+        conversion_result: ProcessingResult,
+        analysis_result: Any = None,
+        quality_metrics: Optional[JSONDict] = None
+    ) -> 'ProcessingResponse':
+        """Factory method for successful processing"""
+        return cls(
+            success=True,
+            conversion_result=conversion_result,
+            analysis_result=analysis_result,
+            quality_metrics=quality_metrics or {},
+            error_details=None,
+            processing_metadata={'completed_at': datetime.now().isoformat()}
+        )
+    @classmethod
+    def error_response(cls, error_message: str, error_context: Optional[JSONDict] = None) -> 'ProcessingResponse':
+        """Factory method for error scenarios"""
+        return cls(
+            success=False,
+            conversion_result=None,
+            analysis_result=None,
+            quality_metrics={},
+            error_details=error_message,
+            processing_metadata=error_context or {'failed_at': datetime.now().isoformat()}
+        )
+@dataclass
+class ApplicationState:
+    """Centralized state management - eliminates state scatter"""
+    session_id: str
+    processing_history: List[ProcessingResponse] = field(default_factory=list)
+    current_gemini_engine_id: Optional[str] = None
+    user_preferences: JSONDict = field(default_factory=dict)
+    system_metrics: JSONDict = field(default_factory=dict)
+    def add_processing_result(self, response: ProcessingResponse) -> 'ApplicationState':
+        """Immutable state update pattern"""
+        new_history = self.processing_history + [response]
+        return ApplicationState(
+            session_id=self.session_id,
+            processing_history=new_history,
+            current_gemini_engine_id=self.current_gemini_engine_id,
+            user_preferences=self.user_preferences,
+            system_metrics=self.system_metrics
+        )
+# ==================== STRATEGIC ABSTRACTION LAYER ====================
+class ProcessingOrchestrator(Protocol):
+    """Interface abstraction - enables component replacement"""
+    async def process_document(self, request: ProcessingRequest) -> ProcessingResponse:
+        """Core processing contract"""
+        ...
+    def get_processing_status(self) -> JSONDict:
+        """System health interface"""
+        ...
+class UIResponseFactory(Protocol):
+    """UI generation abstraction - separates presentation from logic"""
+    def create_success_response(self, response: ProcessingResponse) -> Tuple[str, str, str, JSONDict]:
+        """Generate UI components for successful processing"""
+        ...
+    def create_error_response(self, error_message: str) -> Tuple[str, str, str, JSONDict]:
+        """Generate UI components for error scenarios"""
+        ...
+# ==================== CORE ORCHESTRATION IMPLEMENTATION ====================
+class DocumentProcessingOrchestrator:
+    """
+    Strategic orchestration layer - coordinates component interactions
+    Design Principles:
+    - Single Responsibility: Document processing coordination only
+    - Dependency Injection: All components provided at construction
+    - Error Boundary: Comprehensive error handling and recovery
+    - Observable: Rich logging and metrics for operational visibility
+    """
+    def __init__(
+        self,
+        file_handler: StreamlineFileHandler,
+        conversion_engine: HFConversionEngine,
+        gemini_manager: GeminiConnectionManager,
+        viz_engine: InteractiveVisualizationEngine,
+        quality_calculator: QualityMetricsCalculator
+    ):
+        self.file_handler = file_handler
+        self.conversion_engine = conversion_engine
+        self.gemini_manager = gemini_manager
+        self.viz_engine = viz_engine
+        self.quality_calculator = quality_calculator
+        # Operational metrics
+        self.processing_count = 0
+        self.error_count = 0
+        self.total_processing_time = 0.0
+    async def process_document(self, request: ProcessingRequest) -> ProcessingResponse:
+        """
+        Primary processing coordination with comprehensive error handling
+        Strategic Approach:
+        1. Input validation and sanitization
+        2. Resource availability verification
+        3. Processing pipeline execution with checkpoints
+        4. Quality assessment and metrics generation
+        5. Response standardization and logging
+        """
+        processing_start = datetime.now()
+        self.processing_count += 1
+        try:
+            logger.info(f"Starting document processing - Session: {request.session_context.get('session_id', 'unknown')}")
+            # Phase 1: Document Ingestion and Validation
+            conversion_result = await self._execute_conversion_pipeline(request)
+            if not conversion_result.success:
+                return ProcessingResponse.error_response(
+                    f"Conversion failed: {conversion_result.error_message}",
+                    {"phase": "conversion", "request_metadata": request.file_metadata}
+                )
+            # Phase 2: AI Analysis (Optional Enhancement)
+            analysis_result = None
+            if request.gemini_api_key:
+                analysis_result = await self._execute_analysis_pipeline(
+                    request, conversion_result
+                )
+                # Note: Analysis failure is non-fatal - system continues with conversion results
+            # Phase 3: Quality Assessment and Metrics Generation
+            quality_metrics = self.quality_calculator.calculate_conversion_quality_metrics(
+                conversion_result, analysis_result
+            )
+            # Phase 4: Response Assembly and Logging
+            processing_duration = (datetime.now() - processing_start).total_seconds()
+            self.total_processing_time += processing_duration
+            logger.info(f"Processing completed successfully in {processing_duration:.2f}s")
+            return ProcessingResponse.success_response(
+                conversion_result=conversion_result,
+                analysis_result=analysis_result,
+                quality_metrics=quality_metrics
+            )
+        except Exception as e:
+            self.error_count += 1
+            error_duration = (datetime.now() - processing_start).total_seconds()
+            logger.error(f"Processing failed after {error_duration:.2f}s: {str(e)}")
+            return ProcessingResponse.error_response(
+                error_message=f"System processing error: {str(e)}",
+                error_context={
+                    "processing_duration": error_duration,
+                    "error_type": type(e).__name__,
+                    "processing_phase": "unknown"
+                }
+            )
+    async def _execute_conversion_pipeline(self, request: ProcessingRequest) -> ProcessingResult:
+        """Isolated conversion processing with resource management"""
+        # Create mock file object for processing
+        class ProcessingFile:
+            def __init__(self, content: bytes, metadata: JSONDict):
+                self.content = content
+                self.name = metadata.get('filename', 'uploaded_file')
+                self.size = len(content)
+            def read(self) -> bytes:
+                return self.content
+        processing_file = ProcessingFile(request.file_content, request.file_metadata)
+        # Execute file processing
+        file_result = await self.file_handler.process_upload(
+            processing_file,
+            metadata_override=request.file_metadata
+        )
+        if not file_result.success:
+            return file_result
+        # Execute document conversion
+        conversion_result = await self.conversion_engine.convert_stream(
+            request.file_content, request.file_metadata
+        )
+        return conversion_result
+    async def _execute_analysis_pipeline(
+        self,
+        request: ProcessingRequest,
+        conversion_result: ProcessingResult
+    ) -> Optional[Any]:
+        """Isolated AI analysis processing with graceful degradation"""
+        try:
+            # Initialize or retrieve Gemini engine
+            gemini_config = GeminiConfig(api_key=request.gemini_api_key)
+            engine_id = await self.gemini_manager.create_engine(
+                request.gemini_api_key, gemini_config
+            )
+            engine = self.gemini_manager.get_engine(engine_id)
+            if not engine:
+                logger.warning("Gemini engine creation failed - proceeding without analysis")
+                return None
+            # Execute analysis
+            analysis_request = AnalysisRequest(
+                content=conversion_result.content,
+                analysis_type=AnalysisType(request.analysis_type),
+                model=GeminiModel(request.model_preference)
+            )
+            analysis_result = await engine.analyze_content(analysis_request)
+            if analysis_result.success:
+                logger.info(f"AI analysis completed - Type: {request.analysis_type}")
+                return analysis_result
+            else:
+                logger.warning(f"AI analysis failed: {analysis_result.error_message}")
+                return None
+        except Exception as e:
+            logger.warning(f"AI analysis pipeline error (non-fatal): {str(e)}")
+            return None
+    def get_processing_status(self) -> JSONDict:
+        """Operational visibility interface"""
+        success_rate = (
+            ((self.processing_count - self.error_count) / self.processing_count * 100)
+            if self.processing_count > 0 else 0
+        )
+        average_processing_time = (
+            self.total_processing_time / self.processing_count
+            if self.processing_count > 0 else 0
+        )
+        return {
+            'total_documents_processed': self.processing_count,
+            'success_rate_percent': success_rate,
+            'error_count': self.error_count,
+            'average_processing_time_seconds': average_processing_time,
+            'total_processing_time_seconds': self.total_processing_time,
+            'status': 'healthy' if success_rate > 90 else 'degraded' if success_rate > 70 else 'unhealthy'
+        }
+# ==================== UI PRESENTATION LAYER ====================
+class GradioResponseFactory:
+    """
+    Strategic UI generation - separates presentation logic from business logic
+    Design Principles:
+    - Presentation Separation: UI generation isolated from business logic
+    - Consistent Interface: Standardized response patterns
+    - Error Communication: Clear, actionable user messaging
+    - Progressive Enhancement: Graceful degradation for failed components
+    """
+    def __init__(self, viz_engine: InteractiveVisualizationEngine):
+        self.viz_engine = viz_engine
+    def create_success_response(
+        self,
+        response: ProcessingResponse
+    ) -> Tuple[str, str, str, JSONDict]:
+        """Generate comprehensive success UI components"""
+        # Status display with professional formatting
+        processing_time = response.conversion_result.processing_time or 0
+        content_length = len(response.conversion_result.content)
+        status_html = f"""
+        <div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 10px; margin: 10px 0;">
+            <h3 style="margin: 0 0 10px 0;">✅ Processing Completed Successfully</h3>
+            <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 15px; margin-top: 15px;">
+                <div>
+                    <strong>Processing Time:</strong><br/>
+                    <span style="font-size: 1.2em;">{processing_time:.2f} seconds</span>
+                </div>
+                <div>
+                    <strong>Content Generated:</strong><br/>
+                    <span style="font-size: 1.2em;">{content_length:,} characters</span>
+                </div>
+                <div>
+                    <strong>Quality Score:</strong><br/>
+                    <span style="font-size: 1.2em;">{response.quality_metrics.get('composite_score', 0):.1f}/10</span>
+                </div>
+            </div>
+        </div>
+        """
+        # Document preview with metadata
+        original_preview = self._generate_document_preview(response.conversion_result.metadata)
+        # Markdown output
+        markdown_content = response.conversion_result.content
+        # Metrics summary for quick review
+        quick_metrics = self._extract_summary_metrics(response)
+        return (
+            status_html,
+            original_preview,
+            markdown_content,
+            quick_metrics
+        )
+    def create_error_response(
+        self,
+        error_message: str,
+        error_context: Optional[JSONDict] = None
+    ) -> Tuple[str, str, str, JSONDict]:
+        """Generate comprehensive error UI components with actionable guidance"""
+        # Determine error severity and user guidance
+        error_type = error_context.get('error_type', 'Unknown') if error_context else 'Unknown'
+        processing_phase = error_context.get('processing_phase', 'unknown') if error_context else 'unknown'
+        # Generate user-friendly error messaging
+        if 'Gemini' in error_message or 'API' in error_message:
+            user_guidance = "This appears to be an AI analysis issue. The document conversion may have succeeded. Check your API key and try again."
+        elif 'conversion' in error_message.lower():
+            user_guidance = "Document conversion failed. Please verify your file format is supported and try again."
+        elif 'resource' in error_message.lower():
+            user_guidance = "System resources are currently limited. Try with a smaller file or wait a moment before retrying."
+        else:
+            user_guidance = "An unexpected error occurred. Please try again or contact support if the problem persists."
+        error_html = f"""
+        <div style="background: linear-gradient(135deg, #ff6b6b 0%, #ee5a52 100%); color: white; padding: 20px; border-radius: 10px; margin: 10px 0;">
+            <h3 style="margin: 0 0 10px 0;">❌ Processing Failed</h3>
+            <div style="background: rgba(255,255,255,0.1); padding: 15px; border-radius: 8px; margin: 10px 0;">
+                <strong>Error Details:</strong><br/>
+                {error_message}
+            </div>
+            <div style="background: rgba(255,255,255,0.1); padding: 15px; border-radius: 8px; margin: 10px 0;">
+                <strong>💡 Recommended Action:</strong><br/>
+                {user_guidance}
+            </div>
+            {f'<p><strong>Error Type:</strong> {error_type} | <strong>Phase:</strong> {processing_phase}</p>' if error_context else ''}
+        </div>
+        """
+        return (
+            error_html,
+            "",  # No preview for errors
+            "",  # No markdown content for errors
+            {"error": error_message, "timestamp": datetime.now().isoformat()}
+        )
+    def _generate_document_preview(self, metadata: JSONDict) -> str:
+        """Generate professional document metadata preview"""
+        original_file = metadata.get('original_file', {})
+        return f"""
+        <div style="background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 8px; padding: 20px; margin: 10px 0;">
+            <h4 style="color: #495057; margin-bottom: 15px;">📄 Document Information</h4>
+            <table style="width: 100%; border-collapse: collapse;">
+                <tr style="border-bottom: 1px solid #dee2e6;">
+                    <td style="padding: 8px; font-weight: bold; color: #6c757d;">Filename:</td>
+                    <td style="padding: 8px;">{original_file.get('filename', 'Unknown')}</td>
+                </tr>
+                <tr style="border-bottom: 1px solid #dee2e6;">
+                    <td style="padding: 8px; font-weight: bold; color: #6c757d;">File Size:</td>
+                    <td style="padding: 8px;">{original_file.get('size', 0) / 1024:.1f} KB</td>
+                </tr>
+                <tr style="border-bottom: 1px solid #dee2e6;">
+                    <td style="padding: 8px; font-weight: bold; color: #6c757d;">Format:</td>
+                    <td style="padding: 8px;">{original_file.get('extension', 'Unknown').upper()}</td>
+                </tr>
+                <tr style="border-bottom: 1px solid #dee2e6;">
+                    <td style="padding: 8px; font-weight: bold; color: #6c757d;">Processing Date:</td>
+                    <td style="padding: 8px;">{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</td>
+                </tr>
+            </table>
+        </div>
+        """
+    def _extract_summary_metrics(self, response: ProcessingResponse) -> JSONDict:
+        """Extract key metrics for UI display"""
+        basic_metrics = response.quality_metrics.get('basic_metrics', {})
+        structural_metrics = response.quality_metrics.get('structural_metrics', {})
+        return {
+            'overall_score': response.quality_metrics.get('composite_score', 0),
+            'processing_time': response.conversion_result.processing_time,
+            'content_statistics': {
+                'total_words': basic_metrics.get('total_words', 0),
+                'total_lines': basic_metrics.get('total_lines', 0),
+                'total_characters': basic_metrics.get('total_characters', 0)
+            },
+            'structural_elements': {
+                'headers': structural_metrics.get('header_count', 0),
+                'lists': structural_metrics.get('list_items', 0),
+                'tables': structural_metrics.get('table_rows', 0),
+                'links': structural_metrics.get('links', 0)
+            },
+            'ai_analysis_available': response.analysis_result is not None and response.analysis_result.success if response.analysis_result else False
+        }
+# ==================== MAIN APPLICATION ASSEMBLY ====================
+class MarkItDownTestingApp:
+    """
+    Strategic application orchestration - human-scale complexity management
+    Core Design Philosophy:
+    - Dependency Injection: All components provided at construction
+    - Single Responsibility: UI orchestration only
+    - Error Boundaries: Comprehensive error handling at interaction level
+    - State Management: Immutable state patterns with clear update paths
+    This class represents the composition root of the application - where all
+    dependencies are wired together and the system boundary is established.
+    """
+    def __init__(
+        self,
+        orchestrator: DocumentProcessingOrchestrator,
+        ui_factory: GradioResponseFactory,
+        initial_state: Optional[ApplicationState] = None
+    ):
+        self.orchestrator = orchestrator
+        self.ui_factory = ui_factory
+        self.app_state = initial_state or ApplicationState(
+            session_id=datetime.now().isoformat()
+        )
+        # Application configuration
+        self.config = {
+            'title': 'MarkItDown Testing Platform',
+            'version': '2.0.0-enterprise',
+            'max_file_size_mb': 50,
+            'supported_formats': ['.pdf', '.docx', '.pptx', '.xlsx', '.txt', '.html', '.htm', '.csv', '.json', '.xml']
+        }
+    def create_interface(self) -> gr.Blocks:
+        """
+        Gradio interface assembly with modular component design
+        Strategic Approach:
+        - Component Isolation: Each UI section is self-contained
+        - Event Handling: Clean separation between UI events and business logic
+        - State Management: Immutable state updates with clear data flow
+        - Error Handling: User-friendly error presentation with recovery guidance
+        """
+        with gr.Blocks(
+            title=self.config['title'],
+            theme=gr.themes.Soft(),
+            analytics_enabled=False
+        ) as interface:
+            # Application state for Gradio
+            gr_state = gr.State(self.app_state)
+            # Main header
+            self._create_application_header()
+            # Primary interface tabs
+            with gr.Tabs():
+                # Document Processing Tab
+                with gr.TabItem("📁 Document Processing"):
+                    processing_components = self._create_processing_interface(gr_state)
+                # Analytics Dashboard Tab
+                with gr.TabItem("📊 Analysis Dashboard"):
+                    analytics_components = self._create_analytics_interface(gr_state)
+                # System Status Tab
+                with gr.TabItem("⚙️ System Status"):
+                    self._create_status_interface()
+            # Wire event handlers with clean separation
+            self._wire_event_handlers(processing_components, analytics_components, gr_state)
+            # Application footer
+            self._create_application_footer()
+        return interface
+    def _create_application_header(self) -> None:
+        """Professional application header with branding"""
+        gr.HTML(f"""
+        <div style="text-align: center; background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
+                    color: white; padding: 2rem; border-radius: 10px; margin-bottom: 2rem;">
+            <h1 style="margin: 0; font-size: 2.5em;">🚀 {self.config['title']}</h1>
+            <p style="margin: 10px 0; font-size: 1.2em;">Enterprise-Grade Document Conversion Testing with AI-Powered Analysis</p>
+            <p style="margin: 0; opacity: 0.9;">
+                <em>Version {self.config['version']} | Powered by Microsoft MarkItDown & Google Gemini</em>
+            </p>
+        </div>
+        """)
+    def _create_processing_interface(self, gr_state: gr.State) -> Dict[str, Any]:
+        """Document processing interface with professional UX"""
+        with gr.Row():
+            with gr.Column(scale=1):
+                gr.Markdown("### 📤 Document Upload & Configuration")
+                # File upload
+                file_upload = gr.File(
+                    label="Select Document",
+                    file_types=self.config['supported_formats'],
+                    type="binary"
+                )
+                # Processing configuration
+                with gr.Accordion("🔧 Processing Configuration", open=True):
+                    gemini_api_key = gr.Textbox(
+                        label="Gemini API Key (Optional)",
+                        type="password",
+                        placeholder="Enter your Google Gemini API key for AI analysis...",
+                        info="Leave empty for basic conversion only"
+                    )
+                    analysis_type = gr.Dropdown(
+                        choices=[
+                            ("Quality Analysis", "quality_analysis"),
+                            ("Structure Review", "structure_review"),
+                            ("Content Summary", "content_summary"),
+                            ("Extraction Quality", "extraction_quality")
+                        ],
+                        value="quality_analysis",
+                        label="Analysis Type"
+                    )
+                    model_preference = gr.Dropdown(
+                        choices=[
+                            ("Gemini 1.5 Pro (Best Quality)", "gemini-1.5-pro"),
+                            ("Gemini 1.5 Flash (Faster)", "gemini-1.5-flash")
+                        ],
+                        value="gemini-1.5-pro",
+                        label="AI Model Preference"
+                    )
+                # Action buttons
+                with gr.Row():
+                    process_btn = gr.Button(
+                        "🚀 Process Document",
+                        variant="primary",
+                        size="lg"
+                    )
+                    clear_btn = gr.Button(
+                        "🔄 Clear Session",
+                        variant="secondary"
+                    )
+            with gr.Column(scale=2):
+                # Results display area
+                gr.Markdown("### 📊 Processing Results")
+                status_display = gr.HTML()
+                with gr.Tabs():
+                    with gr.TabItem("📄 Original Document"):
+                        original_preview = gr.HTML()
+                    with gr.TabItem("📝 Markdown Output"):
+                        markdown_output = gr.Code(
+                            language="markdown",
+                            show_label=False,
+                            interactive=False
+                        )
+                    with gr.TabItem("📈 Quick Metrics"):
+                        quick_metrics = gr.JSON()
+        return {
+            'file_upload': file_upload,
+            'gemini_api_key': gemini_api_key,
+            'analysis_type': analysis_type,
+            'model_preference': model_preference,
+            'process_btn': process_btn,
+            'clear_btn': clear_btn,
+            'status_display': status_display,
+            'original_preview': original_preview,
+            'markdown_output': markdown_output,
+            'quick_metrics': quick_metrics
+        }
+    def _create_analytics_interface(self, gr_state: gr.State) -> Dict[str, Any]:
+        """Analytics dashboard interface"""
+        gr.Markdown("### 📊 Document Analysis Dashboard")
+        with gr.Row():
+            refresh_btn = gr.Button("🔄 Refresh Dashboard", variant="secondary")
+        with gr.Row():
+            quality_dashboard = gr.Plot(label="Quality Analysis Dashboard")
+        with gr.Row():
+            with gr.Column():
+                analysis_summary = gr.Markdown("*Process a document to see analysis results*")
+            with gr.Column():
+                structure_metrics = gr.JSON(label="Structure Analysis")
+        return {
+            'refresh_btn': refresh_btn,
+            'quality_dashboard': quality_dashboard,
+            'analysis_summary': analysis_summary,
+            'structure_metrics': structure_metrics
+        }
+    def _create_status_interface(self) -> None:
+        """System status and health monitoring interface"""
+        gr.Markdown("### ⚙️ System Status & Health")
+        with gr.Row():
+            with gr.Column():
+                system_health = gr.JSON(
+                    label="System Health Metrics",
+                    value=self._get_system_status()
+                )
+            with gr.Column():
+                processing_stats = gr.JSON(
+                    label="Processing Statistics",
+                    value=self.orchestrator.get_processing_status()
+                )
+    def _create_application_footer(self) -> None:
+        """Professional application footer"""
+        gr.HTML("""
+        <div style="text-align: center; padding: 1rem; color: #6c757d; border-top: 1px solid #dee2e6; margin-top: 2rem;">
+            <p>Built with enterprise-grade architecture principles |
+            <a href="https://github.com/microsoft/markitdown">Microsoft MarkItDown</a> |
+            <a href="https://ai.google.dev/">Google Gemini</a></p>
+        </div>
+        """)
+    def _wire_event_handlers(
+        self,
+        processing_components: Dict[str, Any],
+        analytics_components: Dict[str, Any],
+        gr_state: gr.State
+    ) -> None:
+        """Wire event handlers with clean separation of concerns"""
+        # Document processing handler
+        processing_components['process_btn'].click(
+            fn=self._handle_document_processing,
+            inputs=[
+                processing_components['file_upload'],
+                processing_components['gemini_api_key'],
+                processing_components['analysis_type'],
+                processing_components['model_preference'],
+                gr_state
+            ],
+            outputs=[
+                processing_components['status_display'],
+                processing_components['original_preview'],
+                processing_components['markdown_output'],
+                processing_components['quick_metrics'],
+                gr_state
+            ],
+            show_progress="full"
+        )
+        # Clear session handler
+        processing_components['clear_btn'].click(
+            fn=self._handle_session_clear,
+            inputs=[gr_state],
+            outputs=[
+                processing_components['status_display'],
+                processing_components['original_preview'],
+                processing_components['markdown_output'],
+                processing_components['quick_metrics'],
+                gr_state
+            ]
+        )
+        # Analytics refresh handler
+        analytics_components['refresh_btn'].click(
+            fn=self._handle_analytics_refresh,
+            inputs=[gr_state],
+            outputs=[
+                analytics_components['quality_dashboard'],
+                analytics_components['analysis_summary'],
+                analytics_components['structure_metrics']
+            ]
+        )
+    async def _handle_document_processing(
+        self,
+        file_obj,
+        gemini_api_key: str,
+        analysis_type: str,
+        model_preference: str,
+        current_state: ApplicationState
+    ) -> Tuple[str, str, str, JSONDict, ApplicationState]:
+        """
+        Clean event handler - delegates to orchestrator
+        Strategic Design:
+        - Input Validation: Comprehensive request validation
+        - Business Logic Delegation: All processing logic in orchestrator
+        - Error Handling: User-friendly error presentation
+        - State Management: Immutable state updates
+        """
+        # Input validation
+        if not file_obj:
+            error_response = self.ui_factory.create_error_response(
+                "No file uploaded. Please select a document to process."
+            )
+            return (*error_response, current_state)
+        try:
+            # Extract file content and metadata
+            file_content = file_obj.read() if hasattr(file_obj, 'read') else file_obj
+            if isinstance(file_content, str):
+                file_content = file_content.encode('utf-8')
+            file_metadata = {
+                'filename': getattr(file_obj, 'name', 'uploaded_file'),
+                'size': len(file_content),
+                'extension': Path(getattr(file_obj, 'name', 'file.txt')).suffix.lower(),
+                'upload_timestamp': datetime.now().isoformat()
+            }
+            # Create processing request
+            processing_request = ProcessingRequest(
+                file_content=file_content,
+                file_metadata=file_metadata,
+                gemini_api_key=gemini_api_key.strip() if gemini_api_key else None,
+                analysis_type=analysis_type,
+                model_preference=model_preference,
+                session_context={'session_id': current_state.session_id}
+            )
+            # Execute processing through orchestrator
+            processing_response = await self.orchestrator.process_document(processing_request)
+            # Update application state
+            updated_state = current_state.add_processing_result(processing_response)
+            # Generate UI response
+            if processing_response.success:
+                ui_response = self.ui_factory.create_success_response(processing_response)
+            else:
+                ui_response = self.ui_factory.create_error_response(
+                    processing_response.error_details,
+                    processing_response.processing_metadata
+                )
+            return (*ui_response, updated_state)
+        except Exception as e:
+            logger.error(f"Event handler error: {str(e)}")
+            error_response = self.ui_factory.create_error_response(
+                f"System error during processing: {str(e)}"
+            )
+            return (*error_response, current_state)
+    def _handle_session_clear(
+        self,
+        current_state: ApplicationState
+    ) -> Tuple[str, str, str, JSONDict, ApplicationState]:
+        """Clear session with clean state reset"""
+        # Create fresh application state
+        fresh_state = ApplicationState(
+            session_id=datetime.now().isoformat()
+        )
+        # Clear UI components
+        clear_html = """
+        <div style="background: #e3f2fd; border: 1px solid #2196f3; color: #1976d2;
+                    padding: 15px; border-radius: 8px; margin: 10px 0;">
+            <h4 style="margin: 0;">🔄 Session Cleared</h4>
+            <p style="margin: 5px 0 0 0;">Ready for new document processing.</p>
+        </div>
+        """
+        return (
+            clear_html,
+            "",  # Clear preview
+            "",  # Clear markdown
+            {},  # Clear metrics
+            fresh_state
+        )
+    def _handle_analytics_refresh(
+        self,
+        current_state: ApplicationState
+    ) -> Tuple[Any, str, JSONDict]:
+        """Refresh analytics dashboard with latest data"""
+        if not current_state.processing_history:
+            # Empty state visualization
+            import plotly.graph_objects as go
+            empty_fig = go.Figure()
+            empty_fig.add_annotation(
+                x=0.5, y=0.5,
+                xref="paper", yref="paper",
+                text="Process documents to see analytics",
+                showarrow=False,
+                font=dict(size=16, color="gray")
+            )
+            empty_fig.update_layout(
+                title="Analytics Dashboard",
+                height=400
+            )
+            return (
+                empty_fig,
+                "*Process documents to see detailed analysis*",
+                {}
+            )
+        # Get latest successful processing result
+        latest_result = None
+        for result in reversed(current_state.processing_history):
+            if result.success:
+                latest_result = result
+                break
+        if not latest_result:
+            return (
+                empty_fig,
+                "*No successful processing results available*",
+                {}
+            )
+        try:
+            # Generate dashboard visualization
+            quality_dashboard = self.ui_factory.viz_engine.create_quality_dashboard(
+                latest_result.conversion_result,
+                latest_result.analysis_result
+            )
+            # Generate analysis summary
+            if latest_result.analysis_result:
+                analysis_summary = self._format_analysis_summary(latest_result.analysis_result)
+            else:
+                analysis_summary = "**Basic conversion completed.** Add Gemini API key for AI-powered analysis."
+            # Generate structure metrics
+            structure_metrics = latest_result.quality_metrics.get('structural_metrics', {})
+            return (
+                quality_dashboard,
+                analysis_summary,
+                structure_metrics
+            )
+        except Exception as e:
+            logger.error(f"Analytics refresh error: {str(e)}")
+            return (
+                empty_fig,
+                f"*Analytics refresh failed: {str(e)}*",
+                {"error": str(e)}
+            )
+    def _format_analysis_summary(self, analysis_result) -> str:
+        """Format AI analysis results for user presentation"""
+        if not analysis_result or not analysis_result.success:
+            return "*AI analysis not available*"
+        content = analysis_result.content
+        analysis_type = analysis_result.analysis_type.value.replace('_', ' ').title()
+        summary = f"## 🤖 {analysis_type}\n\n"
+        summary += f"**Model:** {analysis_result.model_used.value}  \n"
+        summary += f"**Processing Time:** {analysis_result.processing_time:.2f}s\n\n"
+        # Extract key insights based on analysis type
+        if 'overall_score' in content:
+            summary += f"### 📊 Quality Assessment\n"
+            summary += f"**Overall Score:** {content.get('overall_score', 0)}/10\n\n"
+            scores = []
+            if 'structure_score' in content:
+                scores.append(f"Structure: {content['structure_score']}/10")
+            if 'completeness_score' in content:
+                scores.append(f"Completeness: {content['completeness_score']}/10")
+            if 'accuracy_score' in content:
+                scores.append(f"Accuracy: {content['accuracy_score']}/10")
+            if scores:
+                summary += "**Detailed Scores:** " + " | ".join(scores) + "\n\n"
+        if 'executive_summary' in content:
+            summary += f"### 📋 Executive Summary\n{content['executive_summary']}\n\n"
+        if 'detailed_feedback' in content:
+            feedback = content['detailed_feedback'][:300]
+            summary += f"### 💡 Key Insights\n{feedback}{'...' if len(content['detailed_feedback']) > 300 else ''}\n\n"
+        if 'recommendations' in content and content['recommendations']:
+            summary += f"### 🎯 Recommendations\n"
+            for i, rec in enumerate(content['recommendations'][:3], 1):
+                summary += f"{i}. {rec}\n"
+        return summary
+    def _get_system_status(self) -> JSONDict:
+        """Get comprehensive system status information"""
+        try:
+            import psutil
+            memory = psutil.virtual_memory()
+            cpu_percent = psutil.cpu_percent(interval=1)
+            return {
+                'system': {
+                    'status': 'Operational',
+                    'cpu_usage_percent': cpu_percent,
+                    'memory_usage_percent': memory.percent,
+                    'available_memory_gb': round(memory.available / (1024**3), 2),
+                    'platform': os.name
+                },
+                'application': {
+                    'version': self.config['version'],
+                    'max_file_size_mb': self.config['max_file_size_mb'],
+                    'supported_formats': len(self.config['supported_formats']),
+                    'session_id': self.app_state.session_id
+                },
+                'processing': self.orchestrator.get_processing_status()
+            }
+        except Exception as e:
+            return {
+                'system': {'status': 'Unknown', 'error': str(e)},
+                'application': {'version': self.config['version']},
+                'processing': {'status': 'Unknown'}
+            }
+# ==================== APPLICATION FACTORY & COMPOSITION ROOT ====================
+class ApplicationFactory:
+    """
+    Strategic application composition - dependency injection container
+    Design Principles:
+    - Composition Root: Single location for all dependency wiring
+    - Environment Awareness: Different configurations for different environments
+    - Component Lifecycle: Proper initialization order and cleanup
+    - Configuration Management: Centralized configuration with validation
+    """
+    @staticmethod
+    def create_hf_spaces_app() -> MarkItDownTestingApp:
+        """
+        Factory method for HF Spaces optimized application
+        Optimizations:
+        - Resource Management: Configured for 16GB memory limit
+        - Processing Timeouts: Appropriate for shared infrastructure
+        - Error Recovery: Graceful degradation under resource pressure
+        - Logging Configuration: Production-appropriate logging levels
+        """
+        logger.info("Initializing MarkItDown Testing Platform for HF Spaces deployment")
+        # Core configuration
+        processing_config = ProcessingConfig(
+            max_file_size_mb=50,
+            max_memory_usage_gb=12.0,
+            processing_timeout=300,
+            max_concurrent_processes=2
+        )
+        # Resource management
+        resource_manager = ResourceManager(processing_config)
+        # Document processing components
+        file_handler = StreamlineFileHandler(resource_manager)
+        conversion_engine = HFConversionEngine(resource_manager, processing_config)
+        # AI analysis components
+        gemini_manager = GeminiConnectionManager()
+        # Analytics and visualization
+        viz_config = VisualizationConfig(
+            theme=VisualizationConfig.VisualizationTheme.CORPORATE,
+            width=800,
+            height=600
+        )
+        viz_engine = InteractiveVisualizationEngine(viz_config)
+        quality_calculator = QualityMetricsCalculator()
+        # Core orchestrator
+        orchestrator = DocumentProcessingOrchestrator(
+            file_handler=file_handler,
+            conversion_engine=conversion_engine,
+            gemini_manager=gemini_manager,
+            viz_engine=viz_engine,
+            quality_calculator=quality_calculator
+        )
+        # UI presentation layer
+        ui_factory = GradioResponseFactory(viz_engine)
+        # Application assembly
+        app = MarkItDownTestingApp(
+            orchestrator=orchestrator,
+            ui_factory=ui_factory
+        )
+        logger.info("Application initialized successfully - Ready for HF Spaces deployment")
+        return app
+    @staticmethod
+    def create_local_development_app() -> MarkItDownTestingApp:
+        """Factory method for local development with enhanced debugging"""
+        # Enhanced configuration for local development
+        processing_config = ProcessingConfig(
+            max_file_size_mb=100,
+            max_memory_usage_gb=32.0,
+            processing_timeout=600,
+            max_concurrent_processes=4
+        )
+        # Enable debug logging for development
+        logging.getLogger().setLevel(logging.DEBUG)
+        # Use same component assembly pattern as HF Spaces
+        return ApplicationFactory.create_hf_spaces_app()
+# ==================== ENVIRONMENT SETUP & CONFIGURATION ====================
+def setup_production_environment() -> None:
+    """Configure production environment for optimal performance"""
+    # Environment variables for HF Spaces
+    os.environ.setdefault('GRADIO_TEMP_DIR', '/tmp')
+    os.environ.setdefault('HF_HOME', '/tmp')
+    os.environ.setdefault('PYTHONUNBUFFERED', '1')
+    # Logging configuration
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    # System resource verification
+    try:
+        import psutil
+        memory = psutil.virtual_memory()
+        logger.info(f"Production environment initialized - Available memory: {memory.available / (1024**3):.2f} GB")
+        if memory.available < 2 * (1024**3):  # Less than 2GB available
+            logger.warning("Low memory detected - enabling aggressive cleanup policies")
+    except ImportError:
+        logger.warning("psutil not available - resource monitoring disabled")
+def create_gradio_app() -> gr.Blocks:
+    """
+    Main application factory for Gradio deployment
+    This is the primary entry point for the application, designed to be called
+    by Gradio's deployment infrastructure.
+    """
+    setup_production_environment()
+    # Create application instance
+    app = ApplicationFactory.create_hf_spaces_app()
+    # Create Gradio interface
+    interface = app.create_interface()
+    return interface
+# ==================== MAIN ENTRY POINT ====================
+def main():
+    """
+    Main application entry point for direct execution
+    Supports both development and production deployment modes with
+    appropriate configuration for each environment.
+    """
+    setup_production_environment()
+    # Create and configure application
+    app = ApplicationFactory.create_hf_spaces_app()
+    interface = app.create_interface()
+    # Launch configuration optimized for HF Spaces
+    launch_kwargs = {
+        'server_name': '0.0.0.0',
+        'server_port': int(os.environ.get('PORT', 7860)),
+        'share': False,  # HF Spaces handles sharing
+        'show_error': True,
+        'max_file_size': f"{50 * 1024 * 1024}b",  # 50MB limit
+        'allowed_paths': ['/tmp'],
+        'root_path': os.environ.get('GRADIO_ROOT_PATH', '')
+    }
+    # Launch application
+    try:
+        logger.info(f"Launching MarkItDown Testing Platform on port {launch_kwargs['server_port']}")
+        interface.launch(**launch_kwargs)
+    except Exception as e:
+        logger.error(f"Application launch failed: {str(e)}")
+        raise
+# ==================== MODULE INTERFACE ====================
+# Public API for external integration
+__all__ = [
+    'MarkItDownTestingApp',
+    'ApplicationFactory',
+    'ProcessingRequest',
+    'ProcessingResponse',
+    'create_gradio_app',
+    'main'
+]
+if __name__ == "__main__":
+    main()

core/modules.py ADDED Viewed

	@@ -0,0 +1,416 @@

+"""
+Enterprise-Grade Core Modules for MarkItDown Testing Platform
+Strategic Design Philosophy:
+- Stateless architecture for HF Spaces optimization
+- Resource-aware processing with automatic cleanup
+- Comprehensive error handling and recovery mechanisms
+- Modular design enabling easy component replacement
+This module implements the foundational processing layer with strict
+separation of concerns and enterprise-grade error handling.
+"""
+import asyncio
+import tempfile
+import shutil
+import os
+import gc
+import json
+import logging
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Optional, List, Union, AsyncGenerator
+from dataclasses import dataclass, asdict
+from contextlib import asynccontextmanager
+import aiofiles
+from markitdown import MarkItDown
+import google.generativeai as genai
+from tenacity import retry, stop_after_attempt, wait_exponential
+try:
+    import magic
+except ImportError:
+    magic = None
+import mimetypes
+import psutil
+# Strategic Configuration Management
+from pydantic import JsonValue
+JSONDict = Dict[str, JsonValue]
+# Strategic Configuration Management
+@dataclass
+class ProcessingConfig:
+    """Centralized configuration for processing parameters"""
+    max_file_size_mb: int = 50
+    max_memory_usage_gb: float = 12.0
+    temp_cleanup_interval: int = 300  # seconds
+    max_concurrent_processes: int = 3
+    processing_timeout: int = 300
+    gemini_timeout: int = 60
+    retry_attempts: int = 3
+@dataclass
+class ProcessingResult:
+    """Standardized result container for all processing operations"""
+    success: bool
+    content: str
+    metadata: JSONDict
+    error_message: Optional[str] = None
+    processing_time: Optional[float] = None
+    resource_usage: Optional[JSONDict] = None
+class ResourceManager:
+    """
+    Enterprise-grade resource management for HF Spaces constraints
+    Strategic Approach:
+    - Proactive resource monitoring
+    - Automatic cleanup mechanisms
+    - Memory-efficient processing patterns
+    - Graceful degradation under resource pressure
+    """
+    def __init__(self, config: ProcessingConfig):
+        self.config = config
+        self.active_processes = set()
+        self.temp_directories = set()
+    def check_resource_availability(self, file_size_bytes: int) -> bool:
+        """Validate resource availability before processing"""
+        # Convert bytes to MB for comparison
+        file_size_mb = file_size_bytes / (1024 * 1024)
+        if file_size_mb > self.config.max_file_size_mb:
+            raise ResourceError(
+                f"File size {file_size_mb:.2f}MB exceeds limit {self.config.max_file_size_mb}MB"
+            )
+        memory_info = psutil.virtual_memory()
+        process_memory_gb = psutil.Process(os.getpid()).memory_info().rss / (1024**3)
+        if process_memory_gb > self.config.max_memory_usage_gb:
+            raise ResourceError(
+                f"Process memory usage {process_memory_gb:.2f}GB exceeds limit {self.config.max_memory_usage_gb:.2f}GB"
+            )
+        available_gb = memory_info.available / (1024**3)
+        if available_gb < 1.0:
+            raise ResourceError(
+                f"Low system memory available: {available_gb:.2f}GB"
+            )
+        if len(self.active_processes) >= self.config.max_concurrent_processes:
+            raise ResourceError("Maximum concurrent processes exceeded")
+        return True
+    @asynccontextmanager
+    async def managed_temp_directory(self):
+        """Context manager for temporary directory with automatic cleanup"""
+        temp_dir = tempfile.mkdtemp(prefix="markitdown_")
+        self.temp_directories.add(temp_dir)
+        try:
+            yield temp_dir
+        finally:
+            await self._cleanup_directory(temp_dir)
+            self.temp_directories.discard(temp_dir)
+    async def _cleanup_directory(self, directory: str):
+        """Async cleanup of temporary directory"""
+        try:
+            if os.path.exists(directory):
+                shutil.rmtree(directory, ignore_errors=True)
+        except Exception as e:
+            logging.warning(f"Cleanup warning for {directory}: {e}")
+    async def force_cleanup(self):
+        """Emergency cleanup of all managed resources"""
+        cleanup_tasks = [
+            self._cleanup_directory(temp_dir)
+            for temp_dir in list(self.temp_directories)
+        ]
+        if cleanup_tasks:
+            await asyncio.gather(*cleanup_tasks, return_exceptions=True)
+        # Force garbage collection
+        gc.collect()
+        self.temp_directories.clear()
+class StreamlineFileHandler:
+    """
+    Memory-efficient file processing optimized for HF Spaces
+    Key Design Principles:
+    - Stream-based processing to minimize memory footprint
+    - Comprehensive file validation and security checks
+    - Automatic format detection and metadata extraction
+    - Graceful error handling with detailed diagnostics
+    """
+    def __init__(self, resource_manager: ResourceManager):
+        self.resource_manager = resource_manager
+        self.supported_formats = {
+            '.pdf', '.docx', '.pptx', '.xlsx', '.txt',
+            '.html', '.htm', '.csv', '.json', '.xml', '.rtf'
+        }
+    async def process_upload(self, file_obj, metadata_override: Optional[JSONDict] = None) -> ProcessingResult:
+        """Process uploaded file with comprehensive validation"""
+        start_time = datetime.now()
+        try:
+            # Extract basic file information
+            file_info = self._extract_file_metadata(file_obj)
+            if metadata_override:
+                # Merge provided metadata, prioritising supplied values
+                for key, value in metadata_override.items():
+                    if value in (None, ""):
+                        continue
+                    file_info[key] = value
+            # Recalculate support flag using final extension
+            extension = file_info.get('extension', '').lower()
+            if extension:
+                if not extension.startswith('.'):
+                    extension = f'.{extension}'
+                file_info['extension'] = extension
+            file_info['supported'] = file_info.get('extension') in self.supported_formats
+            # Resource availability check
+            self.resource_manager.check_resource_availability(file_info['size'])
+            # Security validation
+            await self._validate_file_security(file_obj, file_info)
+            # Read file content efficiently
+            content = await self._read_file_content(file_obj)
+            processing_time = (datetime.now() - start_time).total_seconds()
+            return ProcessingResult(
+                success=True,
+                content=content,
+                metadata=file_info,
+                processing_time=processing_time,
+                resource_usage=self._get_current_resource_usage()
+            )
+        except Exception as e:
+            return ProcessingResult(
+                success=False,
+                content="",
+                metadata={},
+                error_message=str(e),
+                processing_time=(datetime.now() - start_time).total_seconds()
+            )
+    def _extract_file_metadata(self, file_obj) -> JSONDict:
+        """Extract comprehensive file metadata"""
+        file_path = Path(file_obj.name) if hasattr(file_obj, 'name') else Path("unknown")
+        return {
+            'filename': file_path.name,
+            'extension': file_path.suffix.lower(),
+            'size': getattr(file_obj, 'size', 0),
+            'mime_type': self._detect_mime_type(file_obj),
+            'timestamp': datetime.now().isoformat(),
+            'supported': file_path.suffix.lower() in self.supported_formats
+        }
+    def _detect_mime_type(self, file_obj) -> str:
+        """Detect MIME type using python-magic if available"""
+        mime_type = None
+        if magic is not None and hasattr(file_obj, 'read'):
+            try:
+                current_pos = file_obj.tell() if hasattr(file_obj, 'tell') else 0
+                chunk = file_obj.read(1024)
+                if hasattr(file_obj, 'seek'):
+                    file_obj.seek(current_pos)
+                mime_type = magic.from_buffer(chunk, mime=True) if chunk else None
+            except Exception:
+                mime_type = None
+        if not mime_type:
+            filename = getattr(file_obj, 'name', None)
+            if filename:
+                mime_type = mimetypes.guess_type(filename)[0]
+        return mime_type or 'application/octet-stream'
+    async def _validate_file_security(self, file_obj, file_info: JSONDict):
+        """Comprehensive security validation"""
+        # File extension validation
+        if not file_info['supported']:
+            raise SecurityError(f"Unsupported file format: {file_info['extension']}")
+        # MIME type consistency check
+        expected_mimes = {
+            '.pdf': 'application/pdf',
+            '.docx': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+            '.txt': 'text/plain',
+            '.html': 'text/html',
+            '.htm': 'text/html'
+        }
+        expected_mime = expected_mimes.get(file_info['extension'])
+        if expected_mime and not file_info['mime_type'].startswith(expected_mime.split('/')[0]):
+            logging.warning(f"MIME type mismatch for {file_info['extension']}")
+    async def _read_file_content(self, file_obj) -> bytes:
+        """Memory-efficient file content reading"""
+        if hasattr(file_obj, 'read'):
+            # Reset to beginning if possible
+            if hasattr(file_obj, 'seek'):
+                file_obj.seek(0)
+            return file_obj.read()
+        # Handle different file object types
+        if hasattr(file_obj, 'file'):
+            return file_obj.file.read()
+        raise ValueError("Unable to read file content")
+    def _get_current_resource_usage(self) -> JSONDict:
+        """Get current system resource usage"""
+        memory_info = psutil.virtual_memory()
+        cpu_percent = psutil.cpu_percent(interval=0.1)
+        return {
+            'memory_used_gb': memory_info.used / (1024**3),
+            'memory_available_gb': memory_info.available / (1024**3),
+            'cpu_percent': cpu_percent,
+            'timestamp': datetime.now().isoformat()
+        }
+class HFConversionEngine:
+    """
+    MarkItDown wrapper optimized for stateless HF Spaces execution
+    Strategic Design Features:
+    - Async processing with progress tracking
+    - Automatic resource cleanup and memory management
+    - Comprehensive error handling and retry mechanisms
+    - Performance monitoring and optimization
+    """
+    def __init__(self, resource_manager: ResourceManager, config: ProcessingConfig):
+        self.resource_manager = resource_manager
+        self.config = config
+        self.md = MarkItDown()
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=4, max=10)
+    )
+    async def convert_stream(self, file_content: bytes, file_metadata: JSONDict) -> ProcessingResult:
+        """Stream-based conversion with automatic cleanup and retry logic"""
+        start_time = datetime.now()
+        process_id = id(asyncio.current_task())
+        self.resource_manager.active_processes.add(process_id)
+        try:
+            async with self.resource_manager.managed_temp_directory() as temp_dir:
+                # Create temporary file for MarkItDown processing
+                temp_file_path = await self._create_temp_file(
+                    temp_dir, file_content, file_metadata
+                )
+                # Perform conversion with timeout
+                result = await asyncio.wait_for(
+                    self._execute_conversion(temp_file_path),
+                    timeout=self.config.gemini_timeout
+                )
+                processing_time = (datetime.now() - start_time).total_seconds()
+                return ProcessingResult(
+                    success=True,
+                    content=result.text_content,
+                    metadata={
+                        'original_file': file_metadata,
+                        'conversion_time': processing_time,
+                        'content_length': len(result.text_content),
+                        'conversion_metadata': self._extract_conversion_metadata(result)
+                    },
+                    processing_time=processing_time
+                )
+        except Exception as e:
+            return ProcessingResult(
+                success=False,
+                content="",
+                metadata=file_metadata,
+                error_message=f"Conversion failed: {str(e)}",
+                processing_time=(datetime.now() - start_time).total_seconds()
+            )
+        finally:
+            self.resource_manager.active_processes.discard(process_id)
+    async def _create_temp_file(self, temp_dir: str, content: bytes, metadata: JSONDict) -> str:
+        """Create temporary file for processing"""
+        filename = metadata.get('filename', 'temp_file')
+        temp_file_path = os.path.join(temp_dir, filename)
+        async with aiofiles.open(temp_file_path, 'wb') as temp_file:
+            await temp_file.write(content)
+        return temp_file_path
+    async def _execute_conversion(self, file_path: str):
+        """Execute MarkItDown conversion in thread pool"""
+        loop = asyncio.get_event_loop()
+        return await loop.run_in_executor(
+            None, self.md.convert, file_path
+        )
+    def _extract_conversion_metadata(self, result) -> JSONDict:
+        """Extract metadata from MarkItDown result"""
+        content = result.text_content
+        return {
+            'lines_count': len(content.split('\n')),
+            'word_count': len(content.split()),
+            'character_count': len(content),
+            'has_tables': '|' in content,
+            'has_headers': content.count('#') > 0,
+            'has_lists': content.count('-') > 0 or content.count('*') > 0,
+            'has_links': '[' in content and '](' in content
+        }
+# Custom Exception Classes
+class ResourceError(Exception):
+    """Resource constraint violation"""
+    pass
+class SecurityError(Exception):
+    """Security validation failure"""
+    pass
+class ConversionError(Exception):
+    """Document conversion failure"""
+    pass

examples/usage_examples.py ADDED Viewed

	@@ -0,0 +1,1159 @@

+"""
+MarkItDown Testing Platform - Usage Examples and Testing Suite
+This module provides comprehensive examples and testing capabilities for the
+MarkItDown Testing Platform, demonstrating various use cases and validation scenarios.
+Strategic Examples Coverage:
+- Basic document conversion workflows
+- Advanced AI analysis integration
+- Performance benchmarking and optimization
+- Enterprise integration patterns
+- Error handling and recovery scenarios
+"""
+import asyncio
+import tempfile
+import json
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Any, Optional, Tuple
+import logging
+# Import platform components
+from core.modules import (
+    StreamlineFileHandler, HFConversionEngine, ResourceManager,
+    ProcessingConfig, ProcessingResult
+)
+from llm.gemini_connector import (
+    GeminiAnalysisEngine, GeminiConfig, AnalysisRequest,
+    AnalysisType, GeminiModel, create_analysis_request
+)
+from visualization.analytics_engine import (
+    InteractiveVisualizationEngine, QualityMetricsCalculator,
+    VisualizationConfig
+)
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class DocumentSampleGenerator:
+    """Generate test documents for comprehensive platform testing"""
+    @staticmethod
+    def create_test_html() -> str:
+        """Create comprehensive HTML test document"""
+        return """
+        <!DOCTYPE html>
+        <html lang="en">
+        <head>
+            <meta charset="UTF-8">
+            <meta name="viewport" content="width=device-width, initial-scale=1.0">
+            <title>MarkItDown Test Document</title>
+            <style>
+                .highlight { background-color: yellow; }
+                .important { font-weight: bold; color: red; }
+            </style>
+        </head>
+        <body>
+            <h1>Enterprise Document Conversion Test</h1>
+            <p class="important">This is a comprehensive test document for MarkItDown platform validation.</p>
+            <h2>Document Structure Testing</h2>
+            <p>This section tests various structural elements and their conversion accuracy.</p>
+            <h3>List Testing</h3>
+            <h4>Unordered Lists</h4>
+            <ul>
+                <li>Primary list item with <strong>bold text</strong></li>
+                <li>Secondary item with <em>italic formatting</em></li>
+                <li>Nested list testing:
+                    <ul>
+                        <li>Nested item 1</li>
+                        <li>Nested item 2 with <a href="https://example.com">external link</a></li>
+                    </ul>
+                </li>
+                <li>Code reference: <code>function processDocument()</code></li>
+            </ul>
+            <h4>Ordered Lists</h4>
+            <ol>
+                <li>First priority task</li>
+                <li>Second priority with emphasis: <span class="highlight">critical deadline</span></li>
+                <li>Third priority item</li>
+            </ol>
+            <h3>Table Structure Testing</h3>
+            <table border="1" style="border-collapse: collapse; width: 100%;">
+                <thead>
+                    <tr style="background-color: #f2f2f2;">
+                        <th>Feature</th>
+                        <th>Status</th>
+                        <th>Priority</th>
+                        <th>Notes</th>
+                    </tr>
+                </thead>
+                <tbody>
+                    <tr>
+                        <td>Document Conversion</td>
+                        <td>✅ Complete</td>
+                        <td>High</td>
+                        <td>Core functionality working</td>
+                    </tr>
+                    <tr>
+                        <td>AI Analysis</td>
+                        <td>🔄 In Progress</td>
+                        <td>High</td>
+                        <td>Gemini integration active</td>
+                    </tr>
+                    <tr>
+                        <td>Visualization</td>
+                        <td>✅ Complete</td>
+                        <td>Medium</td>
+                        <td>Interactive dashboards ready</td>
+                    </tr>
+                    <tr>
+                        <td>Export Features</td>
+                        <td>⏳ Planned</td>
+                        <td>Low</td>
+                        <td>Multiple format support</td>
+                    </tr>
+                </tbody>
+            </table>
+            <h3>Code Block Testing</h3>
+            <p>Example Python integration code:</p>
+            <pre><code>
+from markitdown import MarkItDown
+from gemini_connector import GeminiAnalysisEngine
+async def process_document(file_path, api_key):
+    # Initialize components
+    md = MarkItDown()
+    gemini = GeminiAnalysisEngine(api_key)
+    # Convert document
+    result = md.convert(file_path)
+    # Analyze with AI
+    analysis = await gemini.analyze_content(result.text_content)
+    return result, analysis
+            </code></pre>
+            <h3>Link and Reference Testing</h3>
+            <p>This section contains various types of references:</p>
+            <ul>
+                <li>External link: <a href="https://github.com/microsoft/markitdown">Microsoft MarkItDown Repository</a></li>
+                <li>Email reference: <a href="mailto:[email protected]">Technical Support</a></li>
+                <li>Internal reference: <a href="#document-structure-testing">Jump to Structure Section</a></li>
+                <li>Document reference: See the <a href="./documentation.pdf">full documentation</a> for details</li>
+            </ul>
+            <h3>Special Formatting Testing</h3>
+            <div>
+                <p><strong>Bold text emphasis</strong> and <em>italic styling</em> combined with <u>underlined content</u>.</p>
+                <p><del>Strikethrough text</del> and <mark>highlighted content</mark> for attention.</p>
+                <p>Mathematical notation: E = mc<sup>2</sup> and chemical formula: H<sub>2</sub>O.</p>
+            </div>
+            <h2>Content Quality Assessment</h2>
+            <blockquote style="border-left: 4px solid #ccc; padding-left: 16px; font-style: italic;">
+                "Quality is not an act, it is a habit. The systematic approach to document conversion
+                and analysis ensures consistent, reliable results across diverse content types and formats."
+            </blockquote>
+            <h3>Technical Specifications</h3>
+            <div style="background-color: #f9f9f9; padding: 15px; border-radius: 5px;">
+                <h4>Processing Requirements:</h4>
+                <ul>
+                    <li><strong>Maximum File Size:</strong> 50MB (HF Spaces limit)</li>
+                    <li><strong>Supported Formats:</strong> PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML</li>
+                    <li><strong>Processing Timeout:</strong> 5 minutes maximum</li>
+                    <li><strong>Memory Usage:</strong> Optimized for 16GB constraint</li>
+                </ul>
+            </div>
+            <h2>Integration Examples</h2>
+            <p>The following examples demonstrate enterprise integration patterns:</p>
+            <h3>Batch Processing Workflow</h3>
+            <ol>
+                <li>Document ingestion from multiple sources</li>
+                <li>Automated quality validation pipeline</li>
+                <li>AI-powered content analysis and enhancement</li>
+                <li>Structured output generation for downstream systems</li>
+                <li>Comprehensive reporting and analytics</li>
+            </ol>
+            <footer style="margin-top: 50px; padding-top: 20px; border-top: 1px solid #ccc;">
+                <p><em>Generated for MarkItDown Testing Platform - Version 1.0.0</em></p>
+                <p><strong>Document ID:</strong> TEST-DOC-001 | <strong>Created:</strong> {timestamp}</p>
+            </footer>
+        </body>
+        </html>
+        """.replace('{timestamp}', datetime.now().isoformat())
+    @staticmethod
+    def create_test_json() -> str:
+        """Create structured JSON test data"""
+        return json.dumps({
+            "document_metadata": {
+                "title": "MarkItDown Test Configuration",
+                "version": "1.0.0",
+                "created": datetime.now().isoformat(),
+                "description": "Comprehensive test data for platform validation"
+            },
+            "processing_config": {
+                "max_file_size_mb": 50,
+                "timeout_seconds": 300,
+                "supported_formats": [
+                    "pdf", "docx", "pptx", "xlsx",
+                    "html", "txt", "csv", "json", "xml"
+                ],
+                "ai_analysis": {
+                    "enabled": True,
+                    "models": ["gemini-1.5-pro", "gemini-1.5-flash"],
+                    "analysis_types": [
+                        "quality_analysis",
+                        "structure_review",
+                        "content_summary",
+                        "extraction_quality"
+                    ]
+                }
+            },
+            "test_scenarios": [
+                {
+                    "name": "Basic Document Conversion",
+                    "description": "Test core MarkItDown functionality",
+                    "expected_elements": [
+                        "headers", "paragraphs", "lists", "tables", "links"
+                    ],
+                    "quality_threshold": 7.0
+                },
+                {
+                    "name": "AI Analysis Integration",
+                    "description": "Test Gemini API integration",
+                    "required_api_key": True,
+                    "expected_analysis": [
+                        "overall_score", "detailed_feedback", "recommendations"
+                    ],
+                    "quality_threshold": 8.0
+                },
+                {
+                    "name": "Performance Benchmarking",
+                    "description": "Test processing speed and resource usage",
+                    "metrics": [
+                        "processing_time", "memory_usage", "cpu_utilization"
+                    ],
+                    "performance_threshold": {
+                        "processing_time_seconds": 60,
+                        "memory_usage_mb": 1000
+                    }
+                }
+            ],
+            "quality_metrics": {
+                "structural_integrity": {
+                    "weight": 0.3,
+                    "components": ["headers", "lists", "tables", "formatting"]
+                },
+                "content_preservation": {
+                    "weight": 0.25,
+                    "components": ["text_accuracy", "link_preservation", "data_integrity"]
+                },
+                "ai_analysis_quality": {
+                    "weight": 0.25,
+                    "components": ["insight_depth", "recommendation_quality", "accuracy"]
+                },
+                "processing_efficiency": {
+                    "weight": 0.2,
+                    "components": ["speed", "resource_usage", "reliability"]
+                }
+            },
+            "expected_outputs": {
+                "markdown_conversion": {
+                    "min_length": 1000,
+                    "required_elements": ["# ", "## ", "- ", "| "],
+                    "quality_indicators": ["proper_escaping", "structure_preservation"]
+                },
+                "ai_analysis": {
+                    "required_fields": ["overall_score", "detailed_feedback"],
+                    "score_range": [0, 10],
+                    "feedback_min_length": 100
+                },
+                "visualization": {
+                    "chart_types": ["radar", "bar", "treemap", "line"],
+                    "interactive_elements": True,
+                    "export_formats": ["html", "png", "svg"]
+                }
+            }
+        }, indent=2)
+    @staticmethod
+    def create_test_csv() -> str:
+        """Create CSV test data with various data types"""
+        return """Name,Age,Department,Salary,Join Date,Performance Rating,Notes
+John Smith,34,Engineering,75000,2023-01-15,4.5,"Excellent problem solver, team lead"
+Maria Garcia,28,Marketing,62000,2023-03-20,4.2,"Creative campaigns, social media expert"
+David Chen,41,Finance,82000,2022-08-10,4.8,"CPA certified, process optimization"
+Sarah Johnson,29,Engineering,68000,2023-02-28,4.3,"Full-stack developer, agile advocate"
+Michael Brown,36,Sales,71000,2022-11-05,4.6,"Top performer, client relationship expert"
+Lisa Wang,32,Product,78000,2023-01-08,4.4,"UX specialist, user research focused"
+Robert Davis,45,Operations,69000,2022-07-22,4.1,"Supply chain optimization, vendor management"
+Jennifer Wilson,33,HR,59000,2023-04-12,4.3,"Talent acquisition, employee engagement"
+James Anderson,38,Engineering,81000,2022-09-18,4.7,"Senior architect, technical mentoring"
+Emily Taylor,27,Marketing,57000,2023-05-01,4.0,"Digital marketing, content strategy"
+"""
+class PlatformTester:
+    """Comprehensive testing suite for the MarkItDown Testing Platform"""
+    def __init__(self):
+        # Initialize platform components
+        self.processing_config = ProcessingConfig()
+        self.resource_manager = ResourceManager(self.processing_config)
+        self.file_handler = StreamlineFileHandler(self.resource_manager)
+        self.conversion_engine = HFConversionEngine(self.resource_manager, self.processing_config)
+        self.viz_engine = InteractiveVisualizationEngine()
+        self.quality_calculator = QualityMetricsCalculator()
+        # Test results storage
+        self.test_results = []
+        self.performance_metrics = []
+    async def run_basic_conversion_test(self) -> Dict[str, Any]:
+        """Test basic document conversion functionality"""
+        logger.info("Running basic conversion test...")
+        test_start = time.time()
+        try:
+            # Create test HTML document
+            html_content = DocumentSampleGenerator.create_test_html()
+            # Create temporary file
+            with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as temp_file:
+                temp_file.write(html_content)
+                temp_file_path = temp_file.name
+            # Simulate file upload
+            class MockFile:
+                def __init__(self, path):
+                    self.name = path
+                    with open(path, 'rb') as f:
+                        self.content = f.read()
+                    self.size = len(self.content)
+                def read(self):
+                    return self.content
+            mock_file = MockFile(temp_file_path)
+            # Process file
+            file_result = await self.file_handler.process_upload(mock_file)
+            if not file_result.success:
+                return {
+                    'test_name': 'basic_conversion',
+                    'status': 'failed',
+                    'error': file_result.error_message,
+                    'duration': time.time() - test_start
+                }
+            # Convert document
+            conversion_result = await self.conversion_engine.convert_stream(
+                mock_file.content, file_result.metadata
+            )
+            if not conversion_result.success:
+                return {
+                    'test_name': 'basic_conversion',
+                    'status': 'failed',
+                    'error': conversion_result.error_message,
+                    'duration': time.time() - test_start
+                }
+            # Validate conversion results
+            validation_results = self._validate_conversion_output(conversion_result)
+            # Calculate quality metrics
+            quality_metrics = self.quality_calculator.calculate_conversion_quality_metrics(
+                conversion_result
+            )
+            test_duration = time.time() - test_start
+            # Clean up
+            Path(temp_file_path).unlink(missing_ok=True)
+            return {
+                'test_name': 'basic_conversion',
+                'status': 'passed',
+                'duration': test_duration,
+                'validation': validation_results,
+                'quality_metrics': quality_metrics,
+                'performance': {
+                    'processing_time': conversion_result.processing_time,
+                    'content_length': len(conversion_result.content),
+                    'throughput': len(conversion_result.content) / test_duration
+                }
+            }
+        except Exception as e:
+            return {
+                'test_name': 'basic_conversion',
+                'status': 'error',
+                'error': str(e),
+                'duration': time.time() - test_start
+            }
+    async def run_ai_analysis_test(self, gemini_api_key: str) -> Dict[str, Any]:
+        """Test AI analysis integration with Gemini"""
+        logger.info("Running AI analysis test...")
+        if not gemini_api_key:
+            return {
+                'test_name': 'ai_analysis',
+                'status': 'skipped',
+                'reason': 'No API key provided'
+            }
+        test_start = time.time()
+        try:
+            # Create Gemini engine
+            gemini_config = GeminiConfig(api_key=gemini_api_key)
+            gemini_engine = GeminiAnalysisEngine(gemini_config)
+            # Create test content
+            test_content = """
+            # Test Document for AI Analysis
+            This is a comprehensive test document designed to evaluate the AI analysis capabilities
+            of the MarkItDown Testing Platform.
+            ## Document Structure
+            ### Headers and Organization
+            This document contains multiple heading levels to test structure recognition.
+            ### Content Quality
+            The content includes various elements:
+            - Technical terminology and concepts
+            - Business-oriented language and metrics
+            - Complex sentence structures
+            - Tables and structured data
+            | Metric | Value | Status |
+            |--------|-------|--------|
+            | Conversion Quality | 8.5/10 | Excellent |
+            | Processing Speed | 2.3s | Good |
+            | Resource Usage | 45% | Optimal |
+            ## Analysis Requirements
+            This content should trigger comprehensive analysis covering:
+            1. **Structure Assessment**: Header hierarchy and organization
+            2. **Content Quality**: Information density and clarity
+            3. **Technical Accuracy**: Preservation of data and formatting
+            4. **Readability**: AI-friendly output optimization
+            The analysis should provide actionable insights and recommendations
+            for improving document conversion processes.
+            """
+            # Test different analysis types
+            analysis_types = [
+                AnalysisType.QUALITY_ANALYSIS,
+                AnalysisType.STRUCTURE_REVIEW,
+                AnalysisType.CONTENT_SUMMARY
+            ]
+            analysis_results = {}
+            for analysis_type in analysis_types:
+                analysis_request = AnalysisRequest(
+                    content=test_content,
+                    analysis_type=analysis_type,
+                    model=GeminiModel.PRO
+                )
+                result = await gemini_engine.analyze_content(analysis_request)
+                analysis_results[analysis_type.value] = {
+                    'success': result.success,
+                    'processing_time': result.processing_time,
+                    'content_length': len(str(result.content)) if result.success else 0,
+                    'error': result.error_message if not result.success else None
+                }
+            test_duration = time.time() - test_start
+            # Calculate success rate
+            successful_analyses = sum(1 for r in analysis_results.values() if r['success'])
+            success_rate = successful_analyses / len(analysis_types) * 100
+            return {
+                'test_name': 'ai_analysis',
+                'status': 'passed' if success_rate > 0 else 'failed',
+                'duration': test_duration,
+                'success_rate': success_rate,
+                'analysis_results': analysis_results,
+                'performance_metrics': gemini_engine.get_performance_metrics()
+            }
+        except Exception as e:
+            return {
+                'test_name': 'ai_analysis',
+                'status': 'error',
+                'error': str(e),
+                'duration': time.time() - test_start
+            }
+    async def run_performance_benchmark(self) -> Dict[str, Any]:
+        """Run comprehensive performance benchmark"""
+        logger.info("Running performance benchmark...")
+        benchmark_start = time.time()
+        benchmark_results = {
+            'test_name': 'performance_benchmark',
+            'start_time': benchmark_start,
+            'scenarios': []
+        }
+        # Test scenarios with different file sizes and types
+        test_scenarios = [
+            {
+                'name': 'Small HTML Document',
+                'content': DocumentSampleGenerator.create_test_html()[:1000],
+                'format': 'html'
+            },
+            {
+                'name': 'Medium HTML Document',
+                'content': DocumentSampleGenerator.create_test_html(),
+                'format': 'html'
+            },
+            {
+                'name': 'Large HTML Document',
+                'content': DocumentSampleGenerator.create_test_html() * 3,
+                'format': 'html'
+            },
+            {
+                'name': 'Structured JSON Data',
+                'content': DocumentSampleGenerator.create_test_json(),
+                'format': 'json'
+            },
+            {
+                'name': 'CSV Data Table',
+                'content': DocumentSampleGenerator.create_test_csv(),
+                'format': 'csv'
+            }
+        ]
+        for scenario in test_scenarios:
+            scenario_start = time.time()
+            try:
+                # Create temporary file
+                suffix = f".{scenario['format']}"
+                with tempfile.NamedTemporaryFile(mode='w', suffix=suffix, delete=False) as temp_file:
+                    temp_file.write(scenario['content'])
+                    temp_file_path = temp_file.name
+                # Simulate file processing
+                class MockFile:
+                    def __init__(self, path, content):
+                        self.name = path
+                        self.content = content.encode('utf-8')
+                        self.size = len(self.content)
+                    def read(self):
+                        return self.content
+                mock_file = MockFile(temp_file_path, scenario['content'])
+                # Measure processing steps
+                step_timings = {}
+                # File handling
+                step_start = time.time()
+                file_result = await self.file_handler.process_upload(mock_file)
+                step_timings['file_handling'] = time.time() - step_start
+                if file_result.success:
+                    # Document conversion
+                    step_start = time.time()
+                    conversion_result = await self.conversion_engine.convert_stream(
+                        mock_file.content, file_result.metadata
+                    )
+                    step_timings['conversion'] = time.time() - step_start
+                    if conversion_result.success:
+                        # Quality metrics calculation
+                        step_start = time.time()
+                        quality_metrics = self.quality_calculator.calculate_conversion_quality_metrics(
+                            conversion_result
+                        )
+                        step_timings['quality_calculation'] = time.time() - step_start
+                        scenario_duration = time.time() - scenario_start
+                        scenario_result = {
+                            'name': scenario['name'],
+                            'status': 'success',
+                            'duration': scenario_duration,
+                            'step_timings': step_timings,
+                            'content_stats': {
+                                'input_size': len(scenario['content']),
+                                'output_size': len(conversion_result.content),
+                                'compression_ratio': len(conversion_result.content) / len(scenario['content'])
+                            },
+                            'performance_metrics': {
+                                'throughput_chars_per_sec': len(scenario['content']) / scenario_duration,
+                                'processing_efficiency': quality_metrics.get('composite_score', 0) / scenario_duration
+                            }
+                        }
+                    else:
+                        scenario_result = {
+                            'name': scenario['name'],
+                            'status': 'conversion_failed',
+                            'error': conversion_result.error_message,
+                            'duration': time.time() - scenario_start
+                        }
+                else:
+                    scenario_result = {
+                        'name': scenario['name'],
+                        'status': 'file_handling_failed',
+                        'error': file_result.error_message,
+                        'duration': time.time() - scenario_start
+                    }
+                benchmark_results['scenarios'].append(scenario_result)
+                # Clean up
+                Path(temp_file_path).unlink(missing_ok=True)
+            except Exception as e:
+                benchmark_results['scenarios'].append({
+                    'name': scenario['name'],
+                    'status': 'error',
+                    'error': str(e),
+                    'duration': time.time() - scenario_start
+                })
+        # Calculate overall benchmark metrics
+        successful_scenarios = [s for s in benchmark_results['scenarios'] if s['status'] == 'success']
+        total_duration = time.time() - benchmark_start
+        benchmark_results.update({
+            'total_duration': total_duration,
+            'scenarios_total': len(test_scenarios),
+            'scenarios_successful': len(successful_scenarios),
+            'success_rate': len(successful_scenarios) / len(test_scenarios) * 100,
+            'average_processing_time': sum(s['duration'] for s in successful_scenarios) / len(successful_scenarios) if successful_scenarios else 0,
+            'total_throughput': sum(s.get('performance_metrics', {}).get('throughput_chars_per_sec', 0) for s in successful_scenarios),
+            'status': 'passed' if len(successful_scenarios) > len(test_scenarios) / 2 else 'failed'
+        })
+        return benchmark_results
+    async def run_visualization_test(self) -> Dict[str, Any]:
+        """Test visualization generation capabilities"""
+        logger.info("Running visualization test...")
+        test_start = time.time()
+        try:
+            # Create mock conversion result for testing
+            mock_conversion_result = ProcessingResult(
+                success=True,
+                content=DocumentSampleGenerator.create_test_html(),
+                metadata={
+                    'original_file': {
+                        'filename': 'test_document.html',
+                        'size': 5000,
+                        'extension': '.html'
+                    }
+                },
+                processing_time=2.5
+            )
+            # Test visualization generation
+            visualization_tests = []
+            # Quality Dashboard Test
+            try:
+                dashboard_start = time.time()
+                quality_dashboard = self.viz_engine.create_quality_dashboard(mock_conversion_result)
+                dashboard_duration = time.time() - dashboard_start
+                visualization_tests.append({
+                    'name': 'quality_dashboard',
+                    'status': 'success',
+                    'duration': dashboard_duration,
+                    'chart_type': 'multi-chart dashboard',
+                    'data_points': len(quality_dashboard.data) if hasattr(quality_dashboard, 'data') else 'multiple'
+                })
+            except Exception as e:
+                visualization_tests.append({
+                    'name': 'quality_dashboard',
+                    'status': 'failed',
+                    'error': str(e)
+                })
+            # Structure Analysis Test
+            try:
+                structure_start = time.time()
+                structure_viz = self.viz_engine.create_structural_analysis_viz(mock_conversion_result)
+                structure_duration = time.time() - structure_start
+                visualization_tests.append({
+                    'name': 'structure_analysis',
+                    'status': 'success',
+                    'duration': structure_duration,
+                    'chart_type': 'structural analysis',
+                    'components': 'treemap, pie, bar, scatter'
+                })
+            except Exception as e:
+                visualization_tests.append({
+                    'name': 'structure_analysis',
+                    'status': 'failed',
+                    'error': str(e)
+                })
+            # Export Ready Report Test
+            try:
+                report_start = time.time()
+                export_report = self.viz_engine.create_export_ready_report(mock_conversion_result)
+                report_duration = time.time() - report_start
+                visualization_tests.append({
+                    'name': 'export_report',
+                    'status': 'success',
+                    'duration': report_duration,
+                    'chart_count': len(export_report),
+                    'report_types': list(export_report.keys())
+                })
+            except Exception as e:
+                visualization_tests.append({
+                    'name': 'export_report',
+                    'status': 'failed',
+                    'error': str(e)
+                })
+            test_duration = time.time() - test_start
+            successful_tests = [t for t in visualization_tests if t['status'] == 'success']
+            return {
+                'test_name': 'visualization',
+                'status': 'passed' if len(successful_tests) > 0 else 'failed',
+                'duration': test_duration,
+                'tests_run': len(visualization_tests),
+                'tests_successful': len(successful_tests),
+                'success_rate': len(successful_tests) / len(visualization_tests) * 100,
+                'test_details': visualization_tests
+            }
+        except Exception as e:
+            return {
+                'test_name': 'visualization',
+                'status': 'error',
+                'error': str(e),
+                'duration': time.time() - test_start
+            }
+    async def run_comprehensive_test_suite(self, gemini_api_key: Optional[str] = None) -> Dict[str, Any]:
+        """Run complete test suite with all components"""
+        logger.info("Starting comprehensive test suite...")
+        suite_start = time.time()
+        # Run all tests
+        test_results = []
+        # Basic conversion test
+        basic_test = await self.run_basic_conversion_test()
+        test_results.append(basic_test)
+        # AI analysis test (if API key provided)
+        if gemini_api_key:
+            ai_test = await self.run_ai_analysis_test(gemini_api_key)
+            test_results.append(ai_test)
+        # Performance benchmark
+        perf_test = await self.run_performance_benchmark()
+        test_results.append(perf_test)
+        # Visualization test
+        viz_test = await self.run_visualization_test()
+        test_results.append(viz_test)
+        # Calculate overall results
+        suite_duration = time.time() - suite_start
+        passed_tests = [t for t in test_results if t.get('status') == 'passed']
+        failed_tests = [t for t in test_results if t.get('status') in ['failed', 'error']]
+        # Generate comprehensive report
+        comprehensive_report = {
+            'test_suite': 'MarkItDown Platform Comprehensive Test',
+            'timestamp': datetime.now().isoformat(),
+            'duration': suite_duration,
+            'summary': {
+                'total_tests': len(test_results),
+                'passed': len(passed_tests),
+                'failed': len(failed_tests),
+                'skipped': len([t for t in test_results if t.get('status') == 'skipped']),
+                'success_rate': len(passed_tests) / len(test_results) * 100
+            },
+            'test_results': test_results,
+            'system_info': self._get_system_info(),
+            'recommendations': self._generate_recommendations(test_results),
+            'overall_status': 'PASSED' if len(passed_tests) > len(test_results) / 2 else 'FAILED'
+        }
+        return comprehensive_report
+    def _validate_conversion_output(self, conversion_result: ProcessingResult) -> Dict[str, Any]:
+        """Validate conversion output quality and completeness"""
+        content = conversion_result.content
+        validation_results = {
+            'content_length_ok': len(content) > 100,
+            'has_headers': content.count('#') > 0,
+            'has_lists': content.count('- ') > 0 or content.count('* ') > 0,
+            'has_tables': content.count('|') > 0,
+            'has_links': content.count('](') > 0,
+            'proper_encoding': all(ord(char) < 128 for char in content[:1000]),  # ASCII check sample
+            'no_empty_sections': not bool(content.count('##\n\n##'))
+        }
+        # Calculate validation score
+        validation_score = sum(validation_results.values()) / len(validation_results)
+        validation_results['overall_score'] = validation_score
+        validation_results['status'] = 'passed' if validation_score > 0.7 else 'warning' if validation_score > 0.5 else 'failed'
+        return validation_results
+    def _get_system_info(self) -> Dict[str, Any]:
+        """Get system information for test report"""
+        try:
+            import psutil
+            import platform
+            memory = psutil.virtual_memory()
+            return {
+                'platform': platform.platform(),
+                'python_version': platform.python_version(),
+                'cpu_count': psutil.cpu_count(),
+                'memory_total_gb': memory.total / (1024**3),
+                'memory_available_gb': memory.available / (1024**3),
+                'architecture': platform.architecture()[0]
+            }
+        except Exception as e:
+            return {'error': f'Could not gather system info: {e}'}
+    def _generate_recommendations(self, test_results: List[Dict[str, Any]]) -> List[str]:
+        """Generate recommendations based on test results"""
+        recommendations = []
+        # Analyze test results for recommendations
+        for test in test_results:
+            if test.get('status') == 'failed':
+                test_name = test.get('test_name', 'unknown')
+                recommendations.append(f"❌ {test_name.title()} test failed - investigate {test.get('error', 'unknown error')}")
+            elif test.get('status') == 'passed':
+                test_name = test.get('test_name', 'unknown')
+                # Performance recommendations
+                if 'duration' in test and test['duration'] > 30:
+                    recommendations.append(f"⚠️ {test_name.title()} test took {test['duration']:.2f}s - consider optimization")
+                # Success rate recommendations
+                if 'success_rate' in test and test['success_rate'] < 90:
+                    recommendations.append(f"⚠️ {test_name.title()} success rate is {test['success_rate']:.1f}% - investigate reliability issues")
+        # General recommendations
+        if not any('ai_analysis' in str(test) for test in test_results):
+            recommendations.append("💡 Consider adding Gemini API key for AI analysis testing")
+        if not recommendations:
+            recommendations.append("✅ All tests passed successfully - platform ready for production use")
+        return recommendations
+class UsageExamples:
+    """Practical usage examples for different scenarios"""
+    @staticmethod
+    async def example_basic_usage():
+        """Example: Basic document conversion"""
+        print("=== Basic Document Conversion Example ===")
+        # Initialize components
+        config = ProcessingConfig()
+        resource_manager = ResourceManager(config)
+        file_handler = StreamlineFileHandler(resource_manager)
+        conversion_engine = HFConversionEngine(resource_manager, config)
+        # Create sample document
+        sample_html = DocumentSampleGenerator.create_test_html()
+        # Simulate file upload
+        class MockFile:
+            def __init__(self, content):
+                self.name = "sample.html"
+                self.content = content.encode('utf-8')
+                self.size = len(self.content)
+            def read(self):
+                return self.content
+        mock_file = MockFile(sample_html)
+        try:
+            # Process file
+            print("1. Processing uploaded file...")
+            file_result = await file_handler.process_upload(mock_file)
+            if file_result.success:
+                print(f"   ✅ File processed: {file_result.metadata['filename']}")
+                # Convert document
+                print("2. Converting to Markdown...")
+                conversion_result = await conversion_engine.convert_stream(
+                    mock_file.content, file_result.metadata
+                )
+                if conversion_result.success:
+                    print(f"   ✅ Conversion successful in {conversion_result.processing_time:.2f}s")
+                    print(f"   📄 Generated {len(conversion_result.content)} characters")
+                    print(f"   📋 Preview: {conversion_result.content[:200]}...")
+                    # Calculate quality metrics
+                    print("3. Calculating quality metrics...")
+                    quality_calculator = QualityMetricsCalculator()
+                    metrics = quality_calculator.calculate_conversion_quality_metrics(conversion_result)
+                    print(f"   📊 Composite Score: {metrics.get('composite_score', 0):.1f}/10")
+                    print(f"   📈 Word Count: {metrics.get('basic_metrics', {}).get('total_words', 0)}")
+                    print(f"   🏗️ Structure Elements: {metrics.get('structural_metrics', {}).get('header_count', 0)} headers")
+                else:
+                    print(f"   ❌ Conversion failed: {conversion_result.error_message}")
+            else:
+                print(f"   ❌ File processing failed: {file_result.error_message}")
+        except Exception as e:
+            print(f"   ❌ Example failed: {e}")
+        print("\n" + "="*50 + "\n")
+    @staticmethod
+    async def example_ai_integration(api_key: str):
+        """Example: AI-powered analysis integration"""
+        if not api_key:
+            print("=== AI Integration Example (Skipped - No API Key) ===\n")
+            return
+        print("=== AI-Powered Analysis Example ===")
+        try:
+            # Initialize Gemini engine
+            print("1. Initializing Gemini AI...")
+            gemini_config = GeminiConfig(api_key=api_key)
+            gemini_engine = GeminiAnalysisEngine(gemini_config)
+            # Sample content for analysis
+            sample_content = """
+            # Enterprise Document Management Strategy
+            ## Executive Summary
+            This document outlines our comprehensive approach to modernizing document
+            management processes through automated conversion and AI-powered analysis.
+            ## Key Objectives
+            1. **Standardization**: Convert legacy formats to modern, searchable formats
+            2. **Quality Assurance**: Implement AI-driven quality validation
+            3. **Efficiency**: Reduce manual processing time by 75%
+            4. **Scalability**: Handle 10,000+ documents monthly
+            ## Implementation Timeline
+            | Phase | Duration | Deliverables |
+            |-------|----------|--------------|
+            | Phase 1 | 2 months | Platform deployment |
+            | Phase 2 | 3 months | AI integration |
+            | Phase 3 | 1 month | Quality validation |
+            ## Expected ROI
+            - Processing time reduction: 75%
+            - Quality improvement: 40%
+            - Cost savings: $50,000 annually
+            """
+            # Test different analysis types
+            analysis_types = [
+                (AnalysisType.QUALITY_ANALYSIS, "Quality Assessment"),
+                (AnalysisType.CONTENT_SUMMARY, "Content Summary"),
+                (AnalysisType.STRUCTURE_REVIEW, "Structure Analysis")
+            ]
+            for analysis_type, description in analysis_types:
+                print(f"\n2. Running {description}...")
+                request = AnalysisRequest(
+                    content=sample_content,
+                    analysis_type=analysis_type,
+                    model=GeminiModel.PRO
+                )
+                result = await gemini_engine.analyze_content(request)
+                if result.success:
+                    print(f"   ✅ {description} completed in {result.processing_time:.2f}s")
+                    if analysis_type == AnalysisType.QUALITY_ANALYSIS:
+                        content = result.content
+                        print(f"   📊 Overall Score: {content.get('overall_score', 0)}/10")
+                        print(f"   🏗️ Structure Score: {content.get('structure_score', 0)}/10")
+                        print(f"   📋 Completeness: {content.get('completeness_score', 0)}/10")
+                    elif analysis_type == AnalysisType.CONTENT_SUMMARY:
+                        summary = result.content.get('executive_summary', '')[:200]
+                        print(f"   📝 Summary: {summary}...")
+                else:
+                    print(f"   ❌ {description} failed: {result.error_message}")
+            # Performance metrics
+            print(f"\n3. Performance Metrics:")
+            perf_metrics = gemini_engine.get_performance_metrics()
+            print(f"   📈 Total Requests: {perf_metrics['total_requests']}")
+            print(f"   ⏱️ Average Time: {perf_metrics['average_processing_time']:.2f}s")
+            print(f"   ✅ Success Rate: {perf_metrics['success_rate_percent']:.1f}%")
+        except Exception as e:
+            print(f"   ❌ AI Integration example failed: {e}")
+        print("\n" + "="*50 + "\n")
+    @staticmethod
+    async def example_visualization_generation():
+        """Example: Generate interactive visualizations"""
+        print("=== Visualization Generation Example ===")
+        try:
+            # Create mock results for visualization
+            mock_result = ProcessingResult(
+                success=True,
+                content=DocumentSampleGenerator.create_test_html(),
+                metadata={
+                    'original_file': {'filename': 'test.html', 'size': 5000}
+                },
+                processing_time=2.3
+            )
+            # Initialize visualization engine
+            print("1. Initializing visualization engine...")
+            viz_engine = InteractiveVisualizationEngine()
+            # Generate quality dashboard
+            print("2. Creating quality dashboard...")
+            dashboard_start = time.time()
+            quality_dashboard = viz_engine.create_quality_dashboard(mock_result)
+            dashboard_time = time.time() - dashboard_start
+            print(f"   ✅ Quality dashboard generated in {dashboard_time:.2f}s")
+            print(f"   📊 Chart components: {len(quality_dashboard.data)} data traces")
+            # Generate structure analysis
+            print("3. Creating structure analysis...")
+            structure_start = time.time()
+            structure_viz = viz_engine.create_structural_analysis_viz(mock_result)
+            structure_time = time.time() - structure_start
+            print(f"   ✅ Structure analysis generated in {structure_time:.2f}s")
+            # Generate export report
+            print("4. Creating export-ready report...")
+            report_start = time.time()
+            export_report = viz_engine.create_export_ready_report(mock_result)
+            report_time = time.time() - report_start
+            print(f"   ✅ Export report generated in {report_time:.2f}s")
+            print(f"   📈 Report components: {list(export_report.keys())}")
+            total_time = dashboard_time + structure_time + report_time
+            print(f"\n   📊 Total visualization time: {total_time:.2f}s")
+        except Exception as e:
+            print(f"   ❌ Visualization example failed: {e}")
+        print("\n" + "="*50 + "\n")
+async def main():
+    """Main function to run examples and tests"""
+    print("🚀 MarkItDown Testing Platform - Examples & Testing Suite")
+    print("=" * 60)
+    # Run usage examples
+    await UsageExamples.example_basic_usage()
+    # Ask for Gemini API key for AI examples
+    api_key = input("Enter Gemini API key for AI examples (press Enter to skip): ").strip()
+    if api_key:
+        await UsageExamples.example_ai_integration(api_key)
+    await UsageExamples.example_visualization_generation()
+    # Run comprehensive test suite
+    print("🧪 Running Comprehensive Test Suite...")
+    print("=" * 40)
+    tester = PlatformTester()
+    test_results = await tester.run_comprehensive_test_suite(api_key if api_key else None)
+    # Display test results
+    print(f"\n📊 Test Suite Results:")
+    print(f"   Status: {test_results['overall_status']}")
+    print(f"   Duration: {test_results['duration']:.2f}s")
+    print(f"   Success Rate: {test_results['summary']['success_rate']:.1f}%")
+    print(f"   Tests: {test_results['summary']['passed']}/{test_results['summary']['total_tests']} passed")
+    if test_results['recommendations']:
+        print(f"\n💡 Recommendations:")
+        for rec in test_results['recommendations'][:5]:  # Show top 5
+            print(f"   {rec}")
+    # Save detailed results
+    results_file = f"test_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
+    with open(results_file, 'w') as f:
+        json.dump(test_results, f, indent=2, default=str)
+    print(f"\n📁 Detailed results saved to: {results_file}")
+    print("\n✅ Examples and testing complete!")
+if __name__ == "__main__":
+    asyncio.run(main())

llm/gemini_connector.py ADDED Viewed

	@@ -0,0 +1,721 @@

+"""
+Enterprise-Grade Gemini Integration Layer
+Strategic Design Philosophy:
+- Multi-model orchestration for diverse analysis needs
+- Robust error handling with graceful degradation
+- Configurable analysis pipelines for different use cases
+- Performance optimization for HF Spaces constraints
+This module provides a comprehensive Gemini API integration designed for
+enterprise-scale document analysis with focus on reliability and extensibility.
+"""
+import asyncio
+import json
+import logging
+from datetime import datetime
+from typing import Dict, Any, List, Optional, Union, AsyncGenerator
+from dataclasses import dataclass, asdict
+from enum import Enum
+import google.generativeai as genai
+from google.generativeai.types import HarmCategory, HarmBlockThreshold
+from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
+from pydantic import BaseModel, Field, validator, JsonValue
+JSONDict = Dict[str, JsonValue]
+# Strategic Configuration Classes
+class AnalysisType(Enum):
+    """Enumeration of available analysis types"""
+    QUALITY_ANALYSIS = "quality_analysis"
+    STRUCTURE_REVIEW = "structure_review"
+    CONTENT_SUMMARY = "content_summary"
+    COMPARATIVE_ANALYSIS = "comparative_analysis"
+    EXTRACTION_QUALITY = "extraction_quality"
+class GeminiModel(Enum):
+    """Available Gemini models with strategic use case mapping"""
+    PRO = "gemini-1.5-pro"              # Complex analysis, reasoning
+    FLASH = "gemini-1.5-flash"          # Fast processing, summaries
+    PRO_VISION = "gemini-1.5-pro-vision"  # Multimodal content analysis
+@dataclass
+class GeminiConfig:
+    """Comprehensive Gemini API configuration"""
+    api_key: Optional[str] = None
+    default_model: GeminiModel = GeminiModel.PRO
+    max_tokens: int = 8192
+    temperature: float = 0.1  # Low temperature for consistent analysis
+    timeout_seconds: int = 60
+    max_retry_attempts: int = 3
+    safety_settings: Optional[Dict] = None
+    def __post_init__(self):
+        if self.safety_settings is None:
+            self.safety_settings = {
+                HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
+                HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
+                HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
+                HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
+            }
+class AnalysisRequest(BaseModel):
+    """Structured request for document analysis"""
+    content: str = Field(..., description="Markdown content to analyze")
+    analysis_type: AnalysisType = Field(..., description="Type of analysis to perform")
+    model: GeminiModel = Field(default=GeminiModel.PRO, description="Gemini model to use")
+    custom_instructions: Optional[str] = Field(None, description="Additional analysis instructions")
+    context: Optional[JSONDict] = Field(default_factory=dict, description="Additional context")
+    @validator('content')
+    def validate_content(cls, v):
+        if not v or len(v.strip()) < 10:
+            raise ValueError("Content must be at least 10 characters long")
+        return v
+class AnalysisResponse(BaseModel):
+    """Standardized analysis response structure"""
+    success: bool
+    analysis_type: AnalysisType
+    model_used: GeminiModel
+    content: JSONDict
+    metadata: JSONDict
+    error_message: Optional[str] = None
+    processing_time: Optional[float] = None
+    token_usage: Optional[Dict[str, int]] = None
+class GeminiAnalysisEngine:
+    """
+    Comprehensive Gemini-powered analysis system
+    Strategic Architecture:
+    - Multi-model orchestration for optimal performance vs cost
+    - Prompt engineering templates for consistent results
+    - Error handling with intelligent retry mechanisms
+    - Performance monitoring and optimization
+    """
+    # Strategic Prompt Templates for Different Analysis Types
+    ANALYSIS_PROMPTS = {
+        AnalysisType.QUALITY_ANALYSIS: {
+            "system": """You are an expert document conversion analyst specializing in evaluating
+            the quality of document-to-Markdown conversions.""",
+            "template": """
+            Analyze the quality of this Markdown conversion from a document.
+            **Analysis Focus Areas:**
+            1. **Structure Preservation**: How well are headers, lists, tables maintained?
+            2. **Content Completeness**: Is all information preserved from the original?
+            3. **Formatting Accuracy**: Are formatting elements correctly converted?
+            4. **Information Hierarchy**: Is the document structure logical and clear?
+            5. **Readability**: How accessible is the converted content?
+            **Content to Analyze:**
+            ```markdown
+            {content}
+            ```
+            **Provide your analysis as a structured JSON response with these fields:**
+            - overall_score: (1-10 scale)
+            - structure_score: (1-10 scale)
+            - completeness_score: (1-10 scale)
+            - accuracy_score: (1-10 scale)
+            - readability_score: (1-10 scale)
+            - detailed_feedback: (string with specific observations)
+            - recommendations: (array of improvement suggestions)
+            - detected_elements: (object listing found structural elements)
+            Focus on actionable insights and specific examples from the content.
+            """,
+        },
+        AnalysisType.STRUCTURE_REVIEW: {
+            "system": """You are a document structure specialist analyzing Markdown
+            document organization and hierarchy.""",
+            "template": """
+            Conduct a comprehensive structural analysis of this Markdown document.
+            **Structure Analysis Requirements:**
+            1. **Hierarchy Analysis**: Map all heading levels (H1, H2, H3, etc.)
+            2. **List Structures**: Identify and categorize all lists (ordered, unordered, nested)
+            3. **Table Analysis**: Evaluate table formatting and completeness
+            4. **Content Organization**: Assess logical flow and organization
+            5. **Special Elements**: Identify code blocks, links, images, etc.
+            **Content to Analyze:**
+            ```markdown
+            {content}
+            ```
+            **Provide a structured JSON response with:**
+            - document_outline: (hierarchical structure map)
+            - heading_analysis: (object with heading counts and levels)
+            - list_analysis: (detailed list structure information)
+            - table_analysis: (table count, structure, formatting quality)
+            - special_elements: (code blocks, links, images, etc.)
+            - organization_score: (1-10 scale)
+            - structure_recommendations: (array of specific improvements)
+            - accessibility_notes: (readability and navigation considerations)
+            Provide specific examples and actionable structural insights.
+            """,
+        },
+        AnalysisType.CONTENT_SUMMARY: {
+            "system": """You are a content analysis expert specializing in document
+            summarization and thematic analysis.""",
+            "template": """
+            Create a comprehensive content summary and thematic analysis of this document.
+            **Summary Requirements:**
+            1. **Executive Summary**: 2-3 sentence overview of main content
+            2. **Key Topics**: Primary themes and subjects covered
+            3. **Content Classification**: Document type, purpose, target audience
+            4. **Information Density**: Assessment of content richness and depth
+            5. **Actionable Insights**: Key takeaways and important information
+            **Content to Analyze:**
+            ```markdown
+            {content}
+            ```
+            **Provide a structured JSON response with:**
+            - executive_summary: (brief overview)
+            - main_topics: (array of key themes)
+            - document_classification: (type, purpose, audience)
+            - content_metrics: (word count estimates, complexity level)
+            - key_information: (array of important facts/insights)
+            - content_quality: (1-10 scale for informativeness)
+            - summary_recommendations: (suggestions for content improvement)
+            - thematic_analysis: (deeper dive into content themes)
+            Focus on extracting actionable intelligence from the content.
+            """,
+        },
+        AnalysisType.EXTRACTION_QUALITY: {
+            "system": """You are a data extraction quality specialist evaluating how well
+            information was preserved during document conversion.""",
+            "template": """
+            Evaluate the extraction quality and information preservation in this converted document.
+            **Quality Assessment Areas:**
+            1. **Data Preservation**: Are numbers, dates, names preserved accurately?
+            2. **Formatting Retention**: How well were original formatting cues maintained?
+            3. **Context Preservation**: Is the meaning and context clear?
+            4. **Information Completeness**: Are there signs of missing information?
+            5. **Conversion Artifacts**: Any obvious conversion errors or artifacts?
+            **Content to Analyze:**
+            ```markdown
+            {content}
+            ```
+            **Provide a structured JSON response with:**
+            - extraction_score: (1-10 overall quality)
+            - data_accuracy: (assessment of numerical/factual data)
+            - context_preservation: (meaning and relationships maintained)
+            - formatting_quality: (original structure maintained)
+            - completeness_indicators: (signs of missing content)
+            - conversion_artifacts: (errors or issues detected)
+            - quality_recommendations: (specific improvement suggestions)
+            - confidence_level: (confidence in the analysis)
+            Identify specific examples of good and poor extraction quality.
+            """,
+        }
+    }
+    def __init__(self, config: GeminiConfig):
+        """Initialize Gemini Analysis Engine with configuration"""
+        self.config = config
+        self.client = None
+        self._initialize_client()
+        # Performance tracking
+        self.request_count = 0
+        self.total_processing_time = 0.0
+        self.error_count = 0
+    def _initialize_client(self):
+        """Initialize Gemini client with error handling"""
+        if not self.config.api_key:
+            raise ValueError("Gemini API key is required")
+        try:
+            genai.configure(api_key=self.config.api_key)
+            # Test client initialization with a simple call
+            models = genai.list_models()
+            logging.info(f"Gemini client initialized successfully. Available models: {len(list(models))}")
+        except Exception as e:
+            logging.error(f"Failed to initialize Gemini client: {e}")
+            raise
+    @retry(
+        stop=stop_after_attempt(3),
+        wait=wait_exponential(multiplier=1, min=4, max=10),
+        retry=retry_if_exception_type((Exception,))
+    )
+    async def analyze_content(self, request: AnalysisRequest) -> AnalysisResponse:
+        """
+        Execute comprehensive content analysis with retry logic
+        Strategic Processing Approach:
+        1. Validate request and prepare prompt
+        2. Execute analysis with appropriate model
+        3. Parse and validate response
+        4. Return structured results with metadata
+        """
+        start_time = datetime.now()
+        self.request_count += 1
+        try:
+            # Prepare analysis prompt
+            prompt = self._build_analysis_prompt(request)
+            # Select optimal model for analysis type
+            model_name = self._select_optimal_model(request.analysis_type, request.model)
+            # Execute analysis
+            response = await self._execute_analysis(model_name, prompt)
+            # Parse and structure response
+            analysis_content = self._parse_analysis_response(response.text, request.analysis_type)
+            processing_time = (datetime.now() - start_time).total_seconds()
+            self.total_processing_time += processing_time
+            return AnalysisResponse(
+                success=True,
+                analysis_type=request.analysis_type,
+                model_used=GeminiModel(model_name),
+                content=analysis_content,
+                metadata={
+                    'processing_time': processing_time,
+                    'content_length': len(request.content),
+                    'prompt_tokens': len(prompt.split()),  # Rough estimate
+                    'timestamp': start_time.isoformat(),
+                    'request_id': self.request_count
+                },
+                processing_time=processing_time
+            )
+        except Exception as e:
+            self.error_count += 1
+            processing_time = (datetime.now() - start_time).total_seconds()
+            logging.error(f"Analysis failed for {request.analysis_type}: {e}")
+            return AnalysisResponse(
+                success=False,
+                analysis_type=request.analysis_type,
+                model_used=request.model,
+                content={},
+                metadata={'error_timestamp': datetime.now().isoformat()},
+                error_message=str(e),
+                processing_time=processing_time
+            )
+    def _build_analysis_prompt(self, request: AnalysisRequest) -> str:
+        """Build comprehensive analysis prompt from template"""
+        prompt_config = self.ANALYSIS_PROMPTS.get(request.analysis_type)
+        if not prompt_config:
+            raise ValueError(f"Unsupported analysis type: {request.analysis_type}")
+        # Build complete prompt with system context
+        system_context = prompt_config["system"]
+        main_prompt = prompt_config["template"].format(content=request.content)
+        # Add custom instructions if provided
+        if request.custom_instructions:
+            main_prompt += f"\n\n**Additional Instructions:**\n{request.custom_instructions}"
+        # Add context if available
+        if request.context:
+            context_str = "\n".join([f"- {k}: {v}" for k, v in request.context.items()])
+            main_prompt += f"\n\n**Context:**\n{context_str}"
+        return f"{system_context}\n\n{main_prompt}"
+    def _select_optimal_model(self, analysis_type: AnalysisType, requested_model: GeminiModel) -> str:
+        """Select optimal Gemini model based on analysis requirements"""
+        # Strategic model selection based on analysis complexity
+        model_recommendations = {
+            AnalysisType.QUALITY_ANALYSIS: GeminiModel.PRO,      # Complex reasoning
+            AnalysisType.STRUCTURE_REVIEW: GeminiModel.PRO,     # Detailed analysis
+            AnalysisType.CONTENT_SUMMARY: GeminiModel.FLASH,    # Fast processing
+            AnalysisType.COMPARATIVE_ANALYSIS: GeminiModel.PRO, # Complex comparison
+            AnalysisType.EXTRACTION_QUALITY: GeminiModel.PRO,   # Detailed quality assessment
+        }
+        # Use recommended model unless specifically overridden
+        recommended_model = model_recommendations.get(analysis_type, requested_model)
+        return recommended_model.value
+    async def _execute_analysis(self, model_name: str, prompt: str):
+        """Execute analysis using Gemini API with timeout and error handling"""
+        try:
+            model = genai.GenerativeModel(
+                model_name=model_name,
+                safety_settings=self.config.safety_settings
+            )
+            # Configure generation parameters
+            generation_config = genai.GenerationConfig(
+                max_output_tokens=self.config.max_tokens,
+                temperature=self.config.temperature,
+            )
+            # Execute with timeout
+            response = await asyncio.wait_for(
+                asyncio.to_thread(
+                    model.generate_content,
+                    prompt,
+                    generation_config=generation_config
+                ),
+                timeout=self.config.timeout_seconds
+            )
+            return response
+        except asyncio.TimeoutError:
+            raise TimeoutError(f"Gemini API request timed out after {self.config.timeout_seconds} seconds")
+        except Exception as e:
+            raise RuntimeError(f"Gemini API error: {str(e)}")
+    def _parse_analysis_response(self, response_text: str, analysis_type: AnalysisType) -> JSONDict:
+        """Parse and validate Gemini response into structured format"""
+        try:
+            # Try to extract JSON from response
+            json_start = response_text.find('{')
+            json_end = response_text.rfind('}') + 1
+            if json_start >= 0 and json_end > json_start:
+                json_content = response_text[json_start:json_end]
+                parsed_response = json.loads(json_content)
+                # Validate required fields based on analysis type
+                validated_response = self._validate_response_structure(parsed_response, analysis_type)
+                return validated_response
+            else:
+                # Fallback: structure unstructured response
+                return self._structure_unstructured_response(response_text, analysis_type)
+        except json.JSONDecodeError:
+            # Handle non-JSON response
+            return self._structure_unstructured_response(response_text, analysis_type)
+    def _validate_response_structure(self, response: JSONDict, analysis_type: AnalysisType) -> JSONDict:
+        """Validate and ensure response contains required fields"""
+        # Define required fields for each analysis type
+        required_fields = {
+            AnalysisType.QUALITY_ANALYSIS: [
+                'overall_score', 'structure_score', 'completeness_score',
+                'accuracy_score', 'readability_score', 'detailed_feedback'
+            ],
+            AnalysisType.STRUCTURE_REVIEW: [
+                'document_outline', 'heading_analysis', 'organization_score'
+            ],
+            AnalysisType.CONTENT_SUMMARY: [
+                'executive_summary', 'main_topics', 'content_quality'
+            ],
+            AnalysisType.EXTRACTION_QUALITY: [
+                'extraction_score', 'data_accuracy', 'completeness_indicators'
+            ]
+        }
+        expected_fields = required_fields.get(analysis_type, [])
+        # Ensure all required fields are present with defaults
+        validated_response = response.copy()
+        for field in expected_fields:
+            if field not in validated_response:
+                validated_response[field] = self._get_default_field_value(field)
+        return validated_response
+    def _get_default_field_value(self, field_name: str) -> Any:
+        """Get default value for missing response fields"""
+        if field_name.endswith('_score'):
+            return 0
+        elif field_name in ['detailed_feedback', 'executive_summary']:
+            return "Analysis incomplete - field not provided"
+        elif field_name.endswith('_analysis') or field_name == 'document_outline':
+            return {}
+        elif field_name in ['main_topics', 'recommendations']:
+            return []
+        else:
+            return None
+    def _structure_unstructured_response(self, response_text: str, analysis_type: AnalysisType) -> JSONDict:
+        """Structure unstructured response text into expected format"""
+        # Basic structuring based on analysis type
+        base_structure = {
+            'raw_response': response_text,
+            'structured': False,
+            'analysis_timestamp': datetime.now().isoformat()
+        }
+        # Add type-specific default structure
+        if analysis_type == AnalysisType.QUALITY_ANALYSIS:
+            base_structure.update({
+                'overall_score': 5,  # Neutral default
+                'detailed_feedback': response_text,
+                'recommendations': []
+            })
+        elif analysis_type == AnalysisType.CONTENT_SUMMARY:
+            base_structure.update({
+                'executive_summary': response_text[:200] + "..." if len(response_text) > 200 else response_text,
+                'content_quality': 5
+            })
+        return base_structure
+    async def batch_analyze(self, requests: List[AnalysisRequest]) -> List[AnalysisResponse]:
+        """Execute multiple analyses concurrently with rate limiting"""
+        # Implement concurrent processing with semaphore for rate limiting
+        semaphore = asyncio.Semaphore(3)  # Max 3 concurrent requests
+        async def limited_analyze(request):
+            async with semaphore:
+                return await self.analyze_content(request)
+        # Execute all requests concurrently
+        tasks = [limited_analyze(request) for request in requests]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        # Convert exceptions to error responses
+        processed_results = []
+        for i, result in enumerate(results):
+            if isinstance(result, Exception):
+                error_response = AnalysisResponse(
+                    success=False,
+                    analysis_type=requests[i].analysis_type,
+                    model_used=requests[i].model,
+                    content={},
+                    metadata={'batch_error': True},
+                    error_message=str(result)
+                )
+                processed_results.append(error_response)
+            else:
+                processed_results.append(result)
+        return processed_results
+    def get_performance_metrics(self) -> JSONDict:
+        """Get comprehensive performance metrics"""
+        avg_processing_time = (
+            self.total_processing_time / self.request_count
+            if self.request_count > 0 else 0
+        )
+        success_rate = (
+            (self.request_count - self.error_count) / self.request_count * 100
+            if self.request_count > 0 else 0
+        )
+        return {
+            'total_requests': self.request_count,
+            'total_errors': self.error_count,
+            'success_rate_percent': success_rate,
+            'average_processing_time': avg_processing_time,
+            'total_processing_time': self.total_processing_time,
+            'requests_per_minute': self.request_count / max(self.total_processing_time / 60, 1)
+        }
+class GeminiConnectionManager:
+    """
+    Enterprise-grade connection and configuration management for Gemini
+    Strategic Features:
+    - API key validation and secure storage
+    - Connection health monitoring
+    - Automatic reconnection and failover
+    - Usage tracking and optimization recommendations
+    """
+    def __init__(self):
+        self.engines: Dict[str, GeminiAnalysisEngine] = {}
+        self.connection_health = {}
+    async def create_engine(self, api_key: str, config: Optional[GeminiConfig] = None) -> str:
+        """Create and validate new Gemini engine instance"""
+        if not api_key or not api_key.strip():
+            raise ValueError("Valid API key is required")
+        # Create configuration
+        if config is None:
+            config = GeminiConfig(api_key=api_key)
+        else:
+            config.api_key = api_key
+        # Generate unique engine ID
+        engine_id = f"gemini_{hash(api_key) % 10000}"
+        try:
+            # Create and test engine
+            engine = GeminiAnalysisEngine(config)
+            await self._test_engine_connection(engine)
+            # Store engine and mark as healthy
+            self.engines[engine_id] = engine
+            self.connection_health[engine_id] = {
+                'status': 'healthy',
+                'last_check': datetime.now().isoformat(),
+                'consecutive_failures': 0
+            }
+            logging.info(f"Gemini engine {engine_id} created and validated successfully")
+            return engine_id
+        except Exception as e:
+            logging.error(f"Failed to create Gemini engine: {e}")
+            raise
+    async def _test_engine_connection(self, engine: GeminiAnalysisEngine):
+        """Test engine connection with minimal request"""
+        test_request = AnalysisRequest(
+            content="# Test Document\n\nThis is a test.",
+            analysis_type=AnalysisType.CONTENT_SUMMARY,
+            model=GeminiModel.FLASH
+        )
+        response = await engine.analyze_content(test_request)
+        if not response.success:
+            raise RuntimeError(f"Engine connection test failed: {response.error_message}")
+    def get_engine(self, engine_id: str) -> Optional[GeminiAnalysisEngine]:
+        """Get engine instance by ID"""
+        return self.engines.get(engine_id)
+    def list_engines(self) -> Dict[str, JSONDict]:
+        """List all available engines with health status"""
+        result = {}
+        for engine_id, engine in self.engines.items():
+            health = self.connection_health.get(engine_id, {})
+            metrics = engine.get_performance_metrics()
+            result[engine_id] = {
+                'health_status': health,
+                'performance_metrics': metrics,
+                'config': {
+                    'default_model': engine.config.default_model.value,
+                    'max_tokens': engine.config.max_tokens,
+                    'temperature': engine.config.temperature
+                }
+            }
+        return result
+    async def health_check_all(self) -> Dict[str, bool]:
+        """Perform health check on all engines"""
+        health_results = {}
+        for engine_id, engine in self.engines.items():
+            try:
+                await self._test_engine_connection(engine)
+                self.connection_health[engine_id].update({
+                    'status': 'healthy',
+                    'last_check': datetime.now().isoformat(),
+                    'consecutive_failures': 0
+                })
+                health_results[engine_id] = True
+            except Exception as e:
+                self.connection_health[engine_id]['consecutive_failures'] += 1
+                self.connection_health[engine_id]['status'] = 'unhealthy'
+                self.connection_health[engine_id]['last_error'] = str(e)
+                health_results[engine_id] = False
+                logging.warning(f"Health check failed for engine {engine_id}: {e}")
+        return health_results
+# Utility Functions for External Integration
+def create_analysis_request(
+    content: str,
+    analysis_type: str,
+    model: str = "gemini-1.5-pro",
+    custom_instructions: Optional[str] = None
+) -> AnalysisRequest:
+    """Factory function for creating analysis requests"""
+    return AnalysisRequest(
+        content=content,
+        analysis_type=AnalysisType(analysis_type),
+        model=GeminiModel(model),
+        custom_instructions=custom_instructions
+    )
+def extract_key_insights(analysis_response: AnalysisResponse) -> JSONDict:
+    """Extract key insights from analysis response for UI display"""
+    if not analysis_response.success:
+        return {
+            'error': True,
+            'message': analysis_response.error_message,
+            'analysis_type': analysis_response.analysis_type.value
+        }
+    content = analysis_response.content
+    insights = {
+        'analysis_type': analysis_response.analysis_type.value,
+        'model_used': analysis_response.model_used.value,
+        'processing_time': analysis_response.processing_time,
+        'success': True
+    }
+    # Extract type-specific insights
+    if analysis_response.analysis_type == AnalysisType.QUALITY_ANALYSIS:
+        insights.update({
+            'overall_score': content.get('overall_score', 0),
+            'key_scores': {
+                'structure': content.get('structure_score', 0),
+                'completeness': content.get('completeness_score', 0),
+                'accuracy': content.get('accuracy_score', 0),
+                'readability': content.get('readability_score', 0)
+            },
+            'summary': content.get('detailed_feedback', '')[:200] + '...' if content.get('detailed_feedback', '') else ''
+        })
+    elif analysis_response.analysis_type == AnalysisType.CONTENT_SUMMARY:
+        insights.update({
+            'summary': content.get('executive_summary', ''),
+            'topics': content.get('main_topics', []),
+            'quality_score': content.get('content_quality', 0)
+        })
+    return insights
+JSONDict = Dict[str, JsonValue]

requirements.txt ADDED Viewed

	@@ -0,0 +1,43 @@

+# MarkItDown Testing Platform - HF Spaces Optimized Dependencies
+# Strategic dependency selection for enterprise-grade reliability
+# Core Framework Dependencies
+gradio>=4.0.0,<5.0.0                    # UI framework - pinned major version for stability
+markitdown[all]>=0.1.0                # Microsoft's document conversion engine
+# LLM Integration - Gemini Focus
+google-generativeai>=0.3.0,<1.0.0       # Google Gemini API client
+google-auth>=2.0.0                      # Authentication for Google services
+# Data Processing & Visualization
+plotly>=5.17.0,<6.0.0                   # Interactive visualizations
+pandas>=1.5.0,<3.0.0                    # Data manipulation and analysis
+numpy>=1.21.0,<2.0.0                    # Numerical computing foundation
+# Async Processing & File Handling
+aiofiles>=22.0.0                        # Async file operations
+python-multipart>=0.0.6                 # Multipart form data handling
+async-timeout>=4.0.0                  # Timeout management for async operations
+# Image Processing (for multimodal capabilities)
+Pillow>=9.0.0,<11.0.0                   # Image processing library
+python-magic>=0.4.27                    # File type detection
+# Utilities & Performance
+pydantic>=2.0.0,<3.0.0                  # Data validation and settings management
+python-dotenv>=1.0.0                    # Environment variable management
+tenacity>=8.0.0                         # Retry mechanisms for API calls
+# Optional Dependencies for Advanced Features
+openpyxl>=3.1.0                         # Excel file processing
+python-docx>=0.8.11                     # Word document processing
+PyPDF2>=3.0.0                           # PDF processing backup
+# Security & Monitoring (Production considerations)
+cryptography>=41.0.0                    # Secure API key handling
+psutil>=5.9.0                           # System resource monitoring
+# Development & Testing Dependencies
+pytest>=7.0.0                           # Testing framework
+black>=23.0.0                           # Code formatting
+flake8>=6.0.0                           # Code linting

spaces_metadata.yaml ADDED Viewed

	@@ -0,0 +1,77 @@

+# Hugging Face Spaces Configuration
+# MarkItDown Testing Platform Metadata
+title: "MarkItDown Testing Platform"
+emoji: "🚀"
+colorFrom: "blue"
+colorTo: "purple"
+sdk: "gradio"
+sdk_version: "4.0.0"
+app_file: "app.py"
+python_version: "3.10"
+# Space configuration
+models:
+  - google/gemini-pro
+  - microsoft/markitdown
+datasets: []
+# Space settings
+pinned: false
+license: "mit"
+duplicated_from: null
+# Hardware requirements (for paid tiers)
+# hardware: "t4-medium"  # Uncomment for GPU acceleration
+# Environment variables (public - no secrets here)
+variables:
+  GRADIO_THEME: "soft"
+  MAX_FILE_SIZE_MB: "50"
+  PROCESSING_TIMEOUT: "300"
+  APP_VERSION: "1.0.0"
+# App metadata
+short_description: "Enterprise-grade document conversion testing with AI-powered analysis using Microsoft MarkItDown and Google Gemini"
+# Tags for discoverability
+tags:
+  - document-processing
+  - ai-analysis
+  - markdown-conversion
+  - enterprise-tools
+  - quality-assessment
+  - microsoft-markitdown
+  - google-gemini
+  - document-conversion
+  - pdf-processing
+  - office-documents
+# Custom configuration for the space
+custom:
+  features:
+    - "Multi-format document conversion (PDF, DOCX, PPTX, XLSX, HTML, TXT, CSV, JSON, XML)"
+    - "AI-powered quality analysis with Google Gemini"
+    - "Interactive visualization dashboards"
+    - "Real-time processing metrics"
+    - "Export capabilities (Markdown, HTML, JSON, PDF)"
+    - "Enterprise-grade error handling and recovery"
+    - "Performance optimization and monitoring"
+  supported_formats:
+    documents: ["PDF", "DOCX", "PPTX", "XLSX"]
+    web: ["HTML", "HTM"]
+    text: ["TXT", "CSV", "JSON", "XML", "RTF"]
+  analysis_types:
+    - "Quality Analysis: Comprehensive conversion assessment"
+    - "Structure Review: Document hierarchy evaluation"
+    - "Content Summary: Thematic analysis and insights"
+    - "Extraction Quality: Data preservation assessment"
+  technical_specs:
+    max_file_size: "50MB (HF Spaces free tier)"
+    processing_timeout: "5 minutes"
+    memory_optimization: "Stateless architecture with automatic cleanup"
+    concurrent_processing: "Async pipeline with resource management"

utils/deployment.py ADDED Viewed

	@@ -0,0 +1,609 @@

+"""
+Deployment Utilities for MarkItDown Testing Platform
+Strategic deployment tools for various environments:
+- Hugging Face Spaces optimization
+- Local development setup
+- Production environment configuration
+- Health monitoring and diagnostics
+"""
+import os
+import sys
+import json
+import logging
+import platform
+import subprocess
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Any, List, Optional
+import psutil
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class EnvironmentDetector:
+    """Detect and configure for different deployment environments"""
+    @staticmethod
+    def detect_environment() -> str:
+        """Detect the current deployment environment"""
+        # Check for Hugging Face Spaces
+        if os.environ.get('SPACE_ID'):
+            return 'hf_spaces'
+        # Check for Docker environment
+        if os.path.exists('/.dockerenv'):
+            return 'docker'
+        # Check for common cloud providers
+        if os.environ.get('HEROKU_APP_NAME'):
+            return 'heroku'
+        if os.environ.get('AWS_EXECUTION_ENV'):
+            return 'aws'
+        if os.environ.get('GOOGLE_CLOUD_PROJECT'):
+            return 'gcp'
+        # Default to local development
+        return 'local'
+    @staticmethod
+    def get_environment_config(env_type: str) -> Dict[str, Any]:
+        """Get configuration for specific environment"""
+        configs = {
+            'hf_spaces': {
+                'max_file_size_mb': 50,
+                'processing_timeout': 300,
+                'max_memory_gb': 16,
+                'temp_dir': '/tmp',
+                'enable_analytics': True,
+                'log_level': 'INFO',
+                'gradio_config': {
+                    'server_name': '0.0.0.0',
+                    'server_port': 7860,
+                    'share': False,
+                    'enable_queue': True,
+                    'max_file_size': '50mb'
+                }
+            },
+            'docker': {
+                'max_file_size_mb': 100,
+                'processing_timeout': 600,
+                'max_memory_gb': 32,
+                'temp_dir': '/tmp',
+                'enable_analytics': True,
+                'log_level': 'INFO',
+                'gradio_config': {
+                    'server_name': '0.0.0.0',
+                    'server_port': int(os.environ.get('PORT', 7860)),
+                    'share': False,
+                    'enable_queue': True,
+                    'max_file_size': '100mb'
+                }
+            },
+            'local': {
+                'max_file_size_mb': 200,
+                'processing_timeout': 900,
+                'max_memory_gb': 64,
+                'temp_dir': './temp',
+                'enable_analytics': True,
+                'log_level': 'DEBUG',
+                'gradio_config': {
+                    'server_name': '127.0.0.1',
+                    'server_port': 7860,
+                    'share': True,
+                    'enable_queue': False,
+                    'max_file_size': '200mb'
+                }
+            }
+        }
+        return configs.get(env_type, configs['local'])
+class SystemHealthChecker:
+    """System health monitoring and diagnostics"""
+    @staticmethod
+    def check_system_resources() -> Dict[str, Any]:
+        """Check system resource availability"""
+        try:
+            # Memory information
+            memory = psutil.virtual_memory()
+            # CPU information
+            cpu_count = psutil.cpu_count()
+            cpu_percent = psutil.cpu_percent(interval=1)
+            # Disk information
+            disk = psutil.disk_usage('/')
+            # System information
+            system_info = {
+                'platform': platform.platform(),
+                'python_version': platform.python_version(),
+                'architecture': platform.architecture()[0]
+            }
+            return {
+                'timestamp': datetime.now().isoformat(),
+                'memory': {
+                    'total_gb': memory.total / (1024**3),
+                    'available_gb': memory.available / (1024**3),
+                    'used_percent': memory.percent,
+                    'free_gb': memory.free / (1024**3)
+                },
+                'cpu': {
+                    'count': cpu_count,
+                    'usage_percent': cpu_percent,
+                    'load_average': os.getloadavg() if hasattr(os, 'getloadavg') else None
+                },
+                'disk': {
+                    'total_gb': disk.total / (1024**3),
+                    'free_gb': disk.free / (1024**3),
+                    'used_percent': (disk.used / disk.total) * 100
+                },
+                'system': system_info,
+                'status': 'healthy' if memory.percent < 80 and cpu_percent < 80 else 'warning'
+            }
+        except Exception as e:
+            logger.error(f"Health check failed: {e}")
+            return {
+                'timestamp': datetime.now().isoformat(),
+                'status': 'error',
+                'error': str(e)
+            }
+    @staticmethod
+    def check_dependencies() -> Dict[str, Any]:
+        """Check if all required dependencies are available"""
+        required_packages = [
+            'gradio',
+            'markitdown',
+            'google-generativeai',
+            'plotly',
+            'pandas',
+            'numpy',
+            'aiofiles',
+            'tenacity',
+            'psutil',
+            'magic'
+        ]
+        dependency_status = {}
+        all_available = True
+        for package in required_packages:
+            try:
+                __import__(package.replace('-', '_'))
+                dependency_status[package] = {'available': True, 'error': None}
+            except ImportError as e:
+                dependency_status[package] = {'available': False, 'error': str(e)}
+                all_available = False
+        return {
+            'timestamp': datetime.now().isoformat(),
+            'all_dependencies_available': all_available,
+            'packages': dependency_status,
+            'status': 'ready' if all_available else 'missing_dependencies'
+        }
+    @staticmethod
+    def run_comprehensive_health_check() -> Dict[str, Any]:
+        """Run comprehensive system health check"""
+        logger.info("Starting comprehensive health check...")
+        # Detect environment
+        env_type = EnvironmentDetector.detect_environment()
+        env_config = EnvironmentDetector.get_environment_config(env_type)
+        # Check system resources
+        resource_check = SystemHealthChecker.check_system_resources()
+        # Check dependencies
+        dependency_check = SystemHealthChecker.check_dependencies()
+        # Overall health assessment
+        overall_status = 'healthy'
+        issues = []
+        if resource_check.get('status') != 'healthy':
+            overall_status = 'warning'
+            issues.append('System resources under pressure')
+        if not dependency_check.get('all_dependencies_available'):
+            overall_status = 'error'
+            issues.append('Missing required dependencies')
+        return {
+            'timestamp': datetime.now().isoformat(),
+            'environment': {
+                'type': env_type,
+                'config': env_config
+            },
+            'system_resources': resource_check,
+            'dependencies': dependency_check,
+            'overall_status': overall_status,
+            'issues': issues,
+            'recommendations': SystemHealthChecker._generate_recommendations(
+                env_type, resource_check, dependency_check
+            )
+        }
+    @staticmethod
+    def _generate_recommendations(
+        env_type: str,
+        resource_check: Dict[str, Any],
+        dependency_check: Dict[str, Any]
+    ) -> List[str]:
+        """Generate recommendations based on health check results"""
+        recommendations = []
+        # Memory recommendations
+        memory_percent = resource_check.get('memory', {}).get('used_percent', 0)
+        if memory_percent > 80:
+            recommendations.append("High memory usage detected. Consider reducing file sizes or processing batch sizes.")
+        # CPU recommendations
+        cpu_percent = resource_check.get('cpu', {}).get('usage_percent', 0)
+        if cpu_percent > 80:
+            recommendations.append("High CPU usage detected. Consider enabling async processing or reducing concurrent operations.")
+        # Environment-specific recommendations
+        if env_type == 'hf_spaces':
+            recommendations.extend([
+                "Optimize for HF Spaces: Keep file sizes under 50MB",
+                "Use stateless processing to avoid memory leaks",
+                "Implement proper cleanup in temporary directories"
+            ])
+        # Dependency recommendations
+        if not dependency_check.get('all_dependencies_available'):
+            recommendations.append("Install missing dependencies using: pip install -r requirements.txt")
+        return recommendations
+class DeploymentConfigGenerator:
+    """Generate configuration files for different deployment environments"""
+    @staticmethod
+    def generate_hf_spaces_config() -> Dict[str, str]:
+        """Generate configuration files for HF Spaces"""
+        # README.md content
+        readme_content = """---
+title: MarkItDown Testing Platform
+emoji: 🚀
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.0.0
+app_file: app.py
+pinned: false
+---
+# MarkItDown Testing Platform
+Enterprise-grade document conversion testing with AI-powered analysis.
+## Features
+- Multi-format document conversion
+- Google Gemini AI analysis
+- Interactive dashboards
+- Quality metrics and reporting
+## Usage
+1. Upload your document
+2. Configure analysis settings
+3. Enter Gemini API key (optional)
+4. Process and analyze results
+"""
+        # Dockerfile content
+        dockerfile_content = """FROM python:3.10-slim
+ENV PYTHONUNBUFFERED=1
+WORKDIR /app
+RUN apt-get update && apt-get install -y gcc g++ libmagic1 libmagic-dev
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["python", "app.py"]
+"""
+        return {
+            'README.md': readme_content,
+            'Dockerfile': dockerfile_content
+        }
+    @staticmethod
+    def save_deployment_configs(output_dir: str = "."):
+        """Save all deployment configuration files"""
+        output_path = Path(output_dir)
+        output_path.mkdir(exist_ok=True)
+        # Generate HF Spaces configs
+        hf_configs = DeploymentConfigGenerator.generate_hf_spaces_config()
+        for filename, content in hf_configs.items():
+            file_path = output_path / filename
+            with open(file_path, 'w', encoding='utf-8') as f:
+                f.write(content)
+            logger.info(f"Generated {filename} in {output_dir}")
+        logger.info("All deployment configurations generated successfully")
+class DeploymentValidator:
+    """Validate deployment readiness"""
+    @staticmethod
+    def validate_for_hf_spaces() -> Dict[str, Any]:
+        """Validate configuration for HF Spaces deployment"""
+        validation_results = {
+            'timestamp': datetime.now().isoformat(),
+            'environment': 'hf_spaces',
+            'checks': {},
+            'overall_status': 'ready',
+            'issues': []
+        }
+        # Check required files
+        required_files = ['app.py', 'requirements.txt', 'README.md']
+        for file in required_files:
+            if os.path.exists(file):
+                validation_results['checks'][f'{file}_exists'] = True
+            else:
+                validation_results['checks'][f'{file}_exists'] = False
+                validation_results['issues'].append(f"Missing required file: {file}")
+                validation_results['overall_status'] = 'error'
+        # Check app.py structure
+        if os.path.exists('app.py'):
+            try:
+                with open('app.py', 'r') as f:
+                    content = f.read()
+                # Check for required components
+                required_components = [
+                    'gradio',
+                    'launch',
+                    'if __name__ == "__main__"'
+                ]
+                for component in required_components:
+                    if component in content:
+                        validation_results['checks'][f'app_{component}'] = True
+                    else:
+                        validation_results['checks'][f'app_{component}'] = False
+                        validation_results['issues'].append(f"Missing component in app.py: {component}")
+                        validation_results['overall_status'] = 'warning'
+            except Exception as e:
+                validation_results['checks']['app_readable'] = False
+                validation_results['issues'].append(f"Cannot read app.py: {e}")
+                validation_results['overall_status'] = 'error'
+        # Check requirements.txt
+        if os.path.exists('requirements.txt'):
+            try:
+                with open('requirements.txt', 'r') as f:
+                    requirements = f.read()
+                # Check for essential packages
+                essential_packages = ['gradio', 'markitdown', 'google-generativeai']
+                for package in essential_packages:
+                    if package in requirements:
+                        validation_results['checks'][f'req_{package}'] = True
+                    else:
+                        validation_results['checks'][f'req_{package}'] = False
+                        validation_results['issues'].append(f"Missing package in requirements.txt: {package}")
+                        validation_results['overall_status'] = 'warning'
+            except Exception as e:
+                validation_results['checks']['requirements_readable'] = False
+                validation_results['issues'].append(f"Cannot read requirements.txt: {e}")
+                validation_results['overall_status'] = 'error'
+        # Check file sizes (HF Spaces has limits)
+        total_size = 0
+        for root, dirs, files in os.walk('.'):
+            for file in files:
+                file_path = os.path.join(root, file)
+                if os.path.exists(file_path):
+                    total_size += os.path.getsize(file_path)
+        total_size_mb = total_size / (1024 * 1024)
+        validation_results['checks']['total_size_mb'] = total_size_mb
+        if total_size_mb > 500:  # HF Spaces limit
+            validation_results['issues'].append(f"Total size ({total_size_mb:.2f}MB) exceeds HF Spaces limit")
+            validation_results['overall_status'] = 'error'
+        return validation_results
+    @staticmethod
+    def generate_deployment_report() -> str:
+        """Generate comprehensive deployment readiness report"""
+        # Run health check
+        health_check = SystemHealthChecker.run_comprehensive_health_check()
+        # Run HF Spaces validation
+        hf_validation = DeploymentValidator.validate_for_hf_spaces()
+        # Generate report
+        report = f"""
+# MarkItDown Testing Platform - Deployment Report
+Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
+## Environment Information
+- **Type**: {health_check['environment']['type']}
+- **Platform**: {health_check['system_resources']['system']['platform']}
+- **Python**: {health_check['system_resources']['system']['python_version']}
+## System Health Status: {health_check['overall_status'].upper()}
+### System Resources
+- **Memory**: {health_check['system_resources']['memory']['available_gb']:.2f}GB available ({health_check['system_resources']['memory']['used_percent']:.1f}% used)
+- **CPU**: {health_check['system_resources']['cpu']['count']} cores, {health_check['system_resources']['cpu']['usage_percent']:.1f}% usage
+- **Disk**: {health_check['system_resources']['disk']['free_gb']:.2f}GB free
+### Dependencies Status: {"✅ READY" if health_check['dependencies']['all_dependencies_available'] else "❌ MISSING"}
+"""
+        # Add dependency details
+        for package, status in health_check['dependencies']['packages'].items():
+            status_icon = "✅" if status['available'] else "❌"
+            report += f"- {status_icon} {package}\n"
+        # Add HF Spaces validation
+        report += f"""
+## HF Spaces Deployment Readiness: {hf_validation['overall_status'].upper()}
+### File Validation
+"""
+        for check, result in hf_validation['checks'].items():
+            status_icon = "✅" if result else "❌"
+            report += f"- {status_icon} {check}\n"
+        # Add issues and recommendations
+        if health_check['issues']:
+            report += "\n### Issues Identified:\n"
+            for issue in health_check['issues']:
+                report += f"- ⚠️ {issue}\n"
+        if hf_validation['issues']:
+            report += "\n### HF Spaces Issues:\n"
+            for issue in hf_validation['issues']:
+                report += f"- ⚠️ {issue}\n"
+        if health_check['recommendations']:
+            report += "\n### Recommendations:\n"
+            for rec in health_check['recommendations']:
+                report += f"- 💡 {rec}\n"
+        # Add deployment commands
+        report += f"""
+## Deployment Commands
+### Local Development
+```bash
+python app.py
+```
+### Docker Deployment
+```bash
+docker build -t markitdown-platform .
+docker run -p 7860:7860 markitdown-platform
+```
+### HF Spaces Deployment
+1. Create new Space on Hugging Face
+2. Upload files or connect GitHub repository
+3. Configure Space settings:
+   - SDK: Gradio
+   - Python version: 3.10
+   - Hardware: CPU (free tier)
+---
+Report generated by MarkItDown Testing Platform Deployment Utils
+"""
+        return report
+def main():
+    """Main function for deployment utilities CLI"""
+    import argparse
+    parser = argparse.ArgumentParser(description='MarkItDown Platform Deployment Utilities')
+    parser.add_argument(
+        'command',
+        choices=['health-check', 'validate', 'generate-configs', 'report'],
+        help='Command to execute'
+    )
+    parser.add_argument(
+        '--output',
+        default='.',
+        help='Output directory for generated files'
+    )
+    parser.add_argument(
+        '--format',
+        choices=['json', 'text'],
+        default='text',
+        help='Output format'
+    )
+    args = parser.parse_args()
+    if args.command == 'health-check':
+        result = SystemHealthChecker.run_comprehensive_health_check()
+        if args.format == 'json':
+            print(json.dumps(result, indent=2))
+        else:
+            print(f"System Status: {result['overall_status'].upper()}")
+            print(f"Environment: {result['environment']['type']}")
+            print(f"Memory Available: {result['system_resources']['memory']['available_gb']:.2f}GB")
+            print(f"Dependencies: {'OK' if result['dependencies']['all_dependencies_available'] else 'MISSING'}")
+            if result['issues']:
+                print("\nIssues:")
+                for issue in result['issues']:
+                    print(f"  - {issue}")
+    elif args.command == 'validate':
+        result = DeploymentValidator.validate_for_hf_spaces()
+        if args.format == 'json':
+            print(json.dumps(result, indent=2))
+        else:
+            print(f"HF Spaces Validation: {result['overall_status'].upper()}")
+            if result['issues']:
+                print("Issues found:")
+                for issue in result['issues']:
+                    print(f"  - {issue}")
+            else:
+                print("✅ Ready for HF Spaces deployment!")
+    elif args.command == 'generate-configs':
+        DeploymentConfigGenerator.save_deployment_configs(args.output)
+        print(f"Configuration files generated in {args.output}")
+    elif args.command == 'report':
+        report = DeploymentValidator.generate_deployment_report()
+        if args.output != '.':
+            os.makedirs(args.output, exist_ok=True)
+            report_file = os.path.join(args.output, 'deployment_report.md')
+            with open(report_file, 'w') as f:
+                f.write(report)
+            print(f"Deployment report saved to {report_file}")
+        else:
+            print(report)
+if __name__ == "__main__":
+    main()

visualization/analytics_engine.py ADDED Viewed

	@@ -0,0 +1,1393 @@

+"""
+Enterprise Visualization Architecture - Strategic Refactoring Implementation
+Core Design Philosophy:
+"Complexity is the enemy of reliable software"
+Architectural Principles Applied:
+- Single Responsibility: Each component handles one concern
+- Dependency Inversion: Abstract interfaces eliminate tight coupling
+- Human-Scale Modularity: Components fit in developer working memory
+- Testable Design: Every component can be unit tested independently
+Strategic Benefits:
+- Maintainability: Clear component boundaries enable team collaboration
+- Extensibility: Plugin architecture supports future requirements
+- Performance: Optimized algorithms with caching strategies
+- Reliability: Comprehensive error boundaries with graceful degradation
+"""
+import logging
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from datetime import datetime
+from typing import Dict, Any, List, Optional, Tuple, Union, Protocol
+from enum import Enum
+import json
+from pydantic import JsonValue
+JSONDict = Dict[str, JsonValue]
+# Strategic import approach - minimal external dependencies
+import plotly.graph_objects as go
+import plotly.express as px
+from plotly.subplots import make_subplots
+import pandas as pd
+import numpy as np
+# Configure enterprise logging
+logger = logging.getLogger(__name__)
+# ==================== STRATEGIC DATA ABSTRACTIONS ====================
+@dataclass(frozen=True)
+class DocumentAnalysisData:
+    """
+    Immutable data container - eliminates circular import dependencies
+    Strategic Design:
+    - Frozen dataclass ensures immutability
+    - Self-contained data eliminates external module coupling
+    - Clear interface enables component testing
+    """
+    content: str
+    metadata: JSONDict
+    processing_metrics: JSONDict = field(default_factory=dict)
+    ai_analysis_data: Optional[JSONDict] = None
+    @classmethod
+    def from_processing_result(cls, conversion_result, analysis_result=None) -> 'DocumentAnalysisData':
+        """Factory method for creating from external processing results"""
+        # Extract content and metadata safely
+        content = getattr(conversion_result, 'content', '') or ''
+        metadata = getattr(conversion_result, 'metadata', {}) or {}
+        # Extract processing metrics
+        processing_metrics = {
+            'processing_time': getattr(conversion_result, 'processing_time', 0),
+            'success': getattr(conversion_result, 'success', False),
+            'content_length': len(content)
+        }
+        # Extract AI analysis data if available
+        ai_data = None
+        if analysis_result and hasattr(analysis_result, 'success') and analysis_result.success:
+            ai_data = {
+                'analysis_type': getattr(analysis_result, 'analysis_type', None),
+                'model_used': getattr(analysis_result, 'model_used', None),
+                'content': getattr(analysis_result, 'content', {}),
+                'processing_time': getattr(analysis_result, 'processing_time', 0)
+            }
+        return cls(
+            content=content,
+            metadata=metadata,
+            processing_metrics=processing_metrics,
+            ai_analysis_data=ai_data
+        )
+@dataclass(frozen=True)
+class StructuralMetrics:
+    """Immutable container for document structural analysis"""
+    header_count: int = 0
+    list_items: int = 0
+    table_rows: int = 0
+    code_blocks: int = 0
+    links: int = 0
+    max_header_depth: int = 0
+    structure_density: float = 0.0
+    def to_dict(self) -> JSONDict:
+        """Convert to dictionary for external consumption"""
+        return {
+            'header_count': self.header_count,
+            'list_items': self.list_items,
+            'table_rows': self.table_rows,
+            'code_blocks': self.code_blocks,
+            'links': self.links,
+            'max_header_depth': self.max_header_depth,
+            'structure_density': self.structure_density
+        }
+@dataclass(frozen=True)
+class QualityAssessment:
+    """Comprehensive quality metrics container"""
+    composite_score: float = 0.0
+    structural_score: float = 0.0
+    content_score: float = 0.0
+    ai_score: float = 0.0
+    performance_score: float = 0.0
+    def to_dict(self) -> JSONDict:
+        return {
+            'composite_score': self.composite_score,
+            'structural_score': self.structural_score,
+            'content_score': self.content_score,
+            'ai_score': self.ai_score,
+            'performance_score': self.performance_score
+        }
+@dataclass(frozen=True)
+class VisualizationRequest:
+    """Request abstraction for visualization generation"""
+    analysis_data: DocumentAnalysisData
+    chart_type: str
+    configuration: JSONDict = field(default_factory=dict)
+    theme: str = 'plotly_white'
+    dimensions: Tuple[int, int] = (800, 600)
+# ==================== COMPONENT INTERFACES ====================
+class ContentAnalyzer(Protocol):
+    """Interface for content analysis components"""
+    def analyze_structure(self, content: str) -> StructuralMetrics:
+        """Analyze document structural elements"""
+        ...
+    def calculate_quality_metrics(self, analysis_data: DocumentAnalysisData) -> QualityAssessment:
+        """Calculate comprehensive quality assessment"""
+        ...
+class ChartRenderer(Protocol):
+    """Interface for chart generation components"""
+    def render_radar_chart(self, data: Dict[str, float], **kwargs) -> go.Figure:
+        """Render radar/polar chart"""
+        ...
+    def render_bar_chart(self, data: Dict[str, float], **kwargs) -> go.Figure:
+        """Render bar chart"""
+        ...
+    def render_treemap(self, data: Dict[str, Any], **kwargs) -> go.Figure:
+        """Render treemap visualization"""
+        ...
+class DashboardComposer(Protocol):
+    """Interface for dashboard composition"""
+    def compose_quality_dashboard(
+        self,
+        quality_metrics: QualityAssessment,
+        structural_metrics: StructuralMetrics,
+        **kwargs
+    ) -> go.Figure:
+        """Compose comprehensive quality dashboard"""
+        ...
+# ==================== CORE IMPLEMENTATION COMPONENTS ====================
+class OptimizedContentAnalyzer:
+    """
+    High-performance content analysis with single-pass parsing
+    Strategic Design:
+    - Single Responsibility: Content analysis only
+    - Performance Optimized: O(n) complexity for all operations
+    - Memory Efficient: Minimal object allocation during parsing
+    - Error Resilient: Handles malformed content gracefully
+    """
+    def __init__(self):
+        self._analysis_cache: Dict[str, StructuralMetrics] = {}
+        self._cache_hit_count = 0
+        self._cache_miss_count = 0
+    def analyze_structure(self, content: str) -> StructuralMetrics:
+        """
+        Single-pass structural analysis with caching
+        Performance Strategy:
+        - Cache results by content hash for identical documents
+        - Single iteration through content lines
+        - Efficient pattern matching with early termination
+        """
+        # Generate cache key from content hash
+        import hashlib
+        content_hash = hashlib.md5(content.encode()).hexdigest()
+        # Check cache first
+        if content_hash in self._analysis_cache:
+            self._cache_hit_count += 1
+            logger.debug(f"Cache hit for content analysis - {self._cache_hit_count} hits")
+            return self._analysis_cache[content_hash]
+        self._cache_miss_count += 1
+        logger.debug(f"Cache miss - analyzing content structure")
+        # Single-pass analysis
+        lines = content.split('\n')
+        total_lines = len(lines)
+        header_count = 0
+        list_items = 0
+        table_rows = 0
+        code_blocks = 0
+        links = 0
+        max_header_depth = 0
+        structural_elements = 0
+        in_code_block = False
+        for line in lines:
+            stripped_line = line.strip()
+            # Skip empty lines
+            if not stripped_line:
+                continue
+            # Code block detection
+            if stripped_line.startswith('```'):
+                if in_code_block:
+                    code_blocks += 1
+                in_code_block = not in_code_block
+                structural_elements += 1
+                continue
+            # Skip analysis inside code blocks
+            if in_code_block:
+                continue
+            # Header analysis
+            if stripped_line.startswith('#'):
+                header_level = len(stripped_line) - len(stripped_line.lstrip('#'))
+                header_count += 1
+                max_header_depth = max(max_header_depth, header_level)
+                structural_elements += 1
+                continue
+            # List item analysis
+            if stripped_line.startswith(('- ', '* ', '+ ')) or (
+                len(stripped_line) > 2 and
+                stripped_line[0].isdigit() and
+                stripped_line[1:3] == '. '
+            ):
+                list_items += 1
+                structural_elements += 1
+                continue
+            # Table row analysis
+            if '|' in stripped_line and stripped_line.count('|') >= 2:
+                table_rows += 1
+                structural_elements += 1
+            # Link analysis (can coexist with other elements)
+            links += stripped_line.count('](')
+        # Calculate structure density
+        structure_density = structural_elements / total_lines if total_lines > 0 else 0.0
+        # Create metrics object
+        metrics = StructuralMetrics(
+            header_count=header_count,
+            list_items=list_items,
+            table_rows=table_rows,
+            code_blocks=code_blocks,
+            links=links,
+            max_header_depth=max_header_depth,
+            structure_density=structure_density
+        )
+        # Cache the result
+        self._analysis_cache[content_hash] = metrics
+        return metrics
+    def calculate_quality_metrics(self, analysis_data: DocumentAnalysisData) -> QualityAssessment:
+        """
+        Comprehensive quality assessment with weighted scoring
+        Strategic Approach:
+        - Multiple quality dimensions with configurable weights
+        - AI analysis integration when available
+        - Performance metrics consideration
+        - Normalized scoring (0-10 scale)
+        """
+        # Analyze document structure
+        structural_metrics = self.analyze_structure(analysis_data.content)
+        # Calculate structural quality score (0-10)
+        structural_score = min(10.0, (
+            (structural_metrics.header_count * 1.0) +
+            (structural_metrics.list_items * 0.5) +
+            (structural_metrics.table_rows * 0.8) +
+            (structural_metrics.code_blocks * 0.6) +
+            (structural_metrics.links * 0.3) +
+            (structural_metrics.structure_density * 10.0)
+        ))
+        # Calculate content quality score
+        content_length = len(analysis_data.content)
+        word_count = len(analysis_data.content.split()) if analysis_data.content else 0
+        content_score = min(10.0, (
+            (min(content_length / 1000, 5.0)) +  # Length factor (up to 5 points)
+            (min(word_count / 200, 3.0)) +       # Word density (up to 3 points)
+            (2.0 if structural_metrics.structure_density > 0.1 else 0.0)  # Structure bonus
+        ))
+        # AI analysis score integration
+        ai_score = 0.0
+        if analysis_data.ai_analysis_data:
+            ai_content = analysis_data.ai_analysis_data.get('content', {})
+            ai_score = ai_content.get('overall_score', 0.0)
+            # Fallback calculation if no overall score
+            if ai_score == 0.0:
+                ai_score = (
+                    ai_content.get('structure_score', 0.0) +
+                    ai_content.get('completeness_score', 0.0) +
+                    ai_content.get('accuracy_score', 0.0) +
+                    ai_content.get('readability_score', 0.0)
+                ) / 4.0
+        # Performance score
+        processing_time = analysis_data.processing_metrics.get('processing_time', 0)
+        performance_score = max(0.0, min(10.0, 10.0 - (processing_time * 0.1)))
+        # Composite score calculation with weights
+        weights = {
+            'structural': 0.3,
+            'content': 0.25,
+            'ai': 0.3,
+            'performance': 0.15
+        }
+        # Adjust weights if AI analysis is not available
+        if ai_score == 0.0:
+            weights = {
+                'structural': 0.45,
+                'content': 0.35,
+                'ai': 0.0,
+                'performance': 0.2
+            }
+        composite_score = (
+            structural_score * weights['structural'] +
+            content_score * weights['content'] +
+            ai_score * weights['ai'] +
+            performance_score * weights['performance']
+        )
+        return QualityAssessment(
+            composite_score=round(composite_score, 2),
+            structural_score=round(structural_score, 2),
+            content_score=round(content_score, 2),
+            ai_score=round(ai_score, 2),
+            performance_score=round(performance_score, 2)
+        )
+    def get_cache_statistics(self) -> JSONDict:
+        """Get cache performance statistics"""
+        total_requests = self._cache_hit_count + self._cache_miss_count
+        hit_rate = self._cache_hit_count / total_requests if total_requests > 0 else 0.0
+        return {
+            'cache_hits': self._cache_hit_count,
+            'cache_misses': self._cache_miss_count,
+            'hit_rate_percent': hit_rate * 100,
+            'cache_size': len(self._analysis_cache)
+        }
+class PlotlyChartRenderer:
+    """
+    Professional chart rendering with consistent styling
+    Strategic Design:
+    - Single Responsibility: Chart generation only
+    - Consistent Theming: Enterprise-appropriate visual standards
+    - Performance Optimized: Efficient Plotly figure generation
+    - Accessibility Compliant: Color-blind friendly palettes
+    """
+    def __init__(self, theme: str = 'plotly_white'):
+        self.theme = theme
+        self.color_palette = [
+            '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
+            '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf'
+        ]
+        self.enterprise_colors = {
+            'primary': '#667eea',
+            'secondary': '#764ba2',
+            'success': '#28a745',
+            'warning': '#ffc107',
+            'danger': '#dc3545',
+            'info': '#17a2b8'
+        }
+    def render_radar_chart(self, data: Dict[str, float], **kwargs) -> go.Figure:
+        """
+        Professional radar chart with enterprise styling
+        Strategic Features:
+        - Consistent color scheme
+        - Responsive design
+        - Clear labeling and legends
+        - Accessibility compliance
+        """
+        title = kwargs.get('title', 'Quality Assessment Radar')
+        categories = list(data.keys())
+        values = list(data.values())
+        fig = go.Figure()
+        fig.add_trace(go.Scatterpolar(
+            r=values,
+            theta=categories,
+            fill='toself',
+            name='Quality Metrics',
+            line=dict(color=self.enterprise_colors['primary'], width=3),
+            fillcolor=f"rgba(102, 126, 234, 0.3)"
+        ))
+        fig.update_layout(
+            polar=dict(
+                radialaxis=dict(
+                    visible=True,
+                    range=[0, 10],
+                    tickfont=dict(size=12),
+                    gridcolor='rgba(128, 128, 128, 0.3)'
+                ),
+                angularaxis=dict(
+                    tickfont=dict(size=12, color='#333333')
+                )
+            ),
+            title=dict(
+                text=title,
+                x=0.5,
+                font=dict(size=16, color='#333333')
+            ),
+            template=self.theme,
+            showlegend=False,
+            width=kwargs.get('width', 600),
+            height=kwargs.get('height', 600)
+        )
+        return fig
+    def render_bar_chart(self, data: Dict[str, float], **kwargs) -> go.Figure:
+        """Professional bar chart with enterprise styling"""
+        title = kwargs.get('title', 'Metrics Comparison')
+        orientation = kwargs.get('orientation', 'v')  # 'v' for vertical, 'h' for horizontal
+        categories = list(data.keys())
+        values = list(data.values())
+        # Color mapping based on values
+        colors = []
+        for value in values:
+            if value >= 8:
+                colors.append(self.enterprise_colors['success'])
+            elif value >= 6:
+                colors.append(self.enterprise_colors['info'])
+            elif value >= 4:
+                colors.append(self.enterprise_colors['warning'])
+            else:
+                colors.append(self.enterprise_colors['danger'])
+        fig = go.Figure()
+        if orientation == 'h':
+            fig.add_trace(go.Bar(
+                x=values,
+                y=categories,
+                orientation='h',
+                marker=dict(color=colors),
+                text=[f'{v:.1f}' for v in values],
+                textposition='inside',
+                textfont=dict(color='white', size=12)
+            ))
+        else:
+            fig.add_trace(go.Bar(
+                x=categories,
+                y=values,
+                marker=dict(color=colors),
+                text=[f'{v:.1f}' for v in values],
+                textposition='outside',
+                textfont=dict(color='#333333', size=12)
+            ))
+        fig.update_layout(
+            title=dict(
+                text=title,
+                x=0.5,
+                font=dict(size=16, color='#333333')
+            ),
+            template=self.theme,
+            showlegend=False,
+            xaxis=dict(title=kwargs.get('x_title', '')),
+            yaxis=dict(title=kwargs.get('y_title', '')),
+            width=kwargs.get('width', 800),
+            height=kwargs.get('height', 500)
+        )
+        return fig
+    def render_treemap(self, data: Dict[str, Any], **kwargs) -> go.Figure:
+        """Professional treemap visualization"""
+        title = kwargs.get('title', 'Structure Analysis')
+        # Prepare data for treemap
+        labels = data.get('labels', [])
+        values = data.get('values', [])
+        parents = data.get('parents', [])
+        if not labels or not values:
+            # Create placeholder treemap
+            labels = ['Content', 'Headers', 'Lists', 'Tables']
+            values = [100, 20, 15, 10]
+            parents = ['', 'Content', 'Content', 'Content']
+        fig = go.Figure(go.Treemap(
+            labels=labels,
+            values=values,
+            parents=parents,
+            textinfo="label+value+percent parent",
+            textfont=dict(size=12),
+            marker=dict(
+                colorscale='Viridis',
+                showscale=True
+            )
+        ))
+        fig.update_layout(
+            title=dict(
+                text=title,
+                x=0.5,
+                font=dict(size=16, color='#333333')
+            ),
+            template=self.theme,
+            width=kwargs.get('width', 800),
+            height=kwargs.get('height', 600)
+        )
+        return fig
+    def render_gauge_chart(self, value: float, **kwargs) -> go.Figure:
+        """Professional gauge chart for single metrics"""
+        title = kwargs.get('title', 'Quality Score')
+        max_value = kwargs.get('max_value', 10)
+        fig = go.Figure(go.Indicator(
+            mode="gauge+number+delta",
+            value=value,
+            domain={'x': [0, 1], 'y': [0, 1]},
+            title={'text': title, 'font': {'size': 16}},
+            delta={'reference': kwargs.get('reference', 7.0)},
+            gauge={
+                'axis': {'range': [None, max_value], 'tickcolor': '#333333'},
+                'bar': {'color': self.enterprise_colors['primary']},
+                'steps': [
+                    {'range': [0, max_value * 0.5], 'color': "lightgray"},
+                    {'range': [max_value * 0.5, max_value * 0.8], 'color': "gray"}
+                ],
+                'threshold': {
+                    'line': {'color': self.enterprise_colors['danger'], 'width': 4},
+                    'thickness': 0.75,
+                    'value': max_value * 0.9
+                }
+            }
+        ))
+        fig.update_layout(
+            template=self.theme,
+            width=kwargs.get('width', 400),
+            height=kwargs.get('height', 400)
+        )
+        return fig
+class EnterpriseDashboardComposer:
+    """
+    Strategic dashboard composition with enterprise-grade layouts
+    Design Philosophy:
+    - Executive-Friendly Layouts: Information hierarchy for decision makers
+    - Responsive Design: Works across different screen sizes
+    - Performance Optimized: Efficient subplot generation
+    - Accessibility Compliant: Clear navigation and labeling
+    """
+    def __init__(self, chart_renderer: PlotlyChartRenderer):
+        self.chart_renderer = chart_renderer
+    def compose_quality_dashboard(
+        self,
+        quality_metrics: QualityAssessment,
+        structural_metrics: StructuralMetrics,
+        **kwargs
+    ) -> go.Figure:
+        """
+        Comprehensive quality dashboard with executive summary layout
+        Strategic Layout:
+        - Top Row: Executive Summary (Overall Score, Key Metrics)
+        - Middle Row: Detailed Analysis (Radar Chart, Bar Chart)
+        - Bottom Row: Supporting Data (Structure Analysis, Performance)
+        """
+        # Create subplot layout with strategic positioning
+        fig = make_subplots(
+            rows=2, cols=3,
+            subplot_titles=(
+                'Quality Overview', 'Detailed Scores', 'Document Structure',
+                'Performance Metrics', 'Structural Elements', 'Analysis Summary'
+            ),
+            specs=[
+                [{"type": "indicator"}, {"type": "polar"}, {"type": "treemap"}],
+                [{"type": "bar"}, {"type": "bar"}, {"type": "table"}]
+            ],
+            vertical_spacing=0.12,
+            horizontal_spacing=0.08
+        )
+        # 1. Overall Quality Gauge (Executive Summary)
+        fig.add_trace(
+            go.Indicator(
+                mode="gauge+number+delta",
+                value=quality_metrics.composite_score,
+                domain={'x': [0, 1], 'y': [0, 1]},
+                title={'text': "Overall Quality Score"},
+                delta={'reference': 7.0},
+                gauge={
+                    'axis': {'range': [None, 10]},
+                    'bar': {'color': "#667eea"},
+                    'steps': [
+                        {'range': [0, 5], 'color': "lightgray"},
+                        {'range': [5, 8], 'color': "gray"}
+                    ],
+                    'threshold': {
+                        'line': {'color': "red", 'width': 4},
+                        'thickness': 0.75,
+                        'value': 9
+                    }
+                }
+            ),
+            row=1, col=1
+        )
+        # 2. Quality Breakdown Radar Chart
+        quality_data = {
+            'Structural': quality_metrics.structural_score,
+            'Content': quality_metrics.content_score,
+            'AI Analysis': quality_metrics.ai_score,
+            'Performance': quality_metrics.performance_score
+        }
+        fig.add_trace(
+            go.Scatterpolar(
+                r=list(quality_data.values()),
+                theta=list(quality_data.keys()),
+                fill='toself',
+                name='Quality Breakdown',
+                line=dict(color='#764ba2', width=2),
+                fillcolor="rgba(118, 75, 162, 0.3)"
+            ),
+            row=1, col=2
+        )
+        # 3. Document Structure Treemap
+        structure_data = self._prepare_structure_treemap_data(structural_metrics)
+        fig.add_trace(
+            go.Treemap(
+                labels=structure_data['labels'],
+                values=structure_data['values'],
+                parents=structure_data['parents'],
+                textinfo="label+value",
+                textfont=dict(size=10)
+            ),
+            row=1, col=3
+        )
+        # 4. Performance Metrics Bar Chart
+        perf_data = {
+            'Processing Speed': quality_metrics.performance_score,
+            'Structure Density': min(structural_metrics.structure_density * 10, 10),
+            'Content Quality': quality_metrics.content_score
+        }
+        fig.add_trace(
+            go.Bar(
+                x=list(perf_data.keys()),
+                y=list(perf_data.values()),
+                marker=dict(color=['#28a745', '#17a2b8', '#ffc107']),
+                name='Performance Metrics'
+            ),
+            row=2, col=1
+        )
+        # 5. Structural Elements Breakdown
+        structure_breakdown = {
+            'Headers': structural_metrics.header_count,
+            'Lists': structural_metrics.list_items,
+            'Tables': structural_metrics.table_rows,
+            'Code Blocks': structural_metrics.code_blocks,
+            'Links': structural_metrics.links
+        }
+        fig.add_trace(
+            go.Bar(
+                x=list(structure_breakdown.values()),
+                y=list(structure_breakdown.keys()),
+                orientation='h',
+                marker=dict(color='#667eea'),
+                name='Structural Elements'
+            ),
+            row=2, col=2
+        )
+        # 6. Analysis Summary Table
+        summary_data = [
+            ['Overall Score', f"{quality_metrics.composite_score:.1f}/10"],
+            ['Structure Elements', f"{sum(structure_breakdown.values())} items"],
+            ['Max Header Depth', f"{structural_metrics.max_header_depth} levels"],
+            ['Structure Density', f"{structural_metrics.structure_density:.1%}"]
+        ]
+        fig.add_trace(
+            go.Table(
+                header=dict(
+                    values=['Metric', 'Value'],
+                    fill_color='#667eea',
+                    font=dict(color='white', size=12),
+                    align='left'
+                ),
+                cells=dict(
+                    values=list(zip(*summary_data)),
+                    fill_color='#f8f9fa',
+                    font=dict(color='#333333', size=11),
+                    align='left'
+                )
+            ),
+            row=2, col=3
+        )
+        # Update layout with enterprise styling
+        fig.update_layout(
+            title=dict(
+                text="Document Conversion Quality Dashboard",
+                x=0.5,
+                font=dict(size=20, color='#333333')
+            ),
+            template='plotly_white',
+            height=kwargs.get('height', 800),
+            showlegend=False,
+            margin=dict(t=100, b=50, l=50, r=50)
+        )
+        # Update polar chart layout
+        fig.update_polars(
+            radialaxis=dict(
+                visible=True,
+                range=[0, 10],
+                tickfont=dict(size=10)
+            )
+        )
+        return fig
+    def _prepare_structure_treemap_data(self, metrics: StructuralMetrics) -> Dict[str, List]:
+        """Prepare data for structure treemap visualization"""
+        total_elements = (
+            metrics.header_count + metrics.list_items +
+            metrics.table_rows + metrics.code_blocks + metrics.links
+        )
+        if total_elements == 0:
+            return {
+                'labels': ['Document', 'Content'],
+                'values': [100, 100],
+                'parents': ['', 'Document']
+            }
+        return {
+            'labels': [
+                'Document', 'Headers', 'Lists', 'Tables', 'Code Blocks', 'Links'
+            ],
+            'values': [
+                total_elements,
+                max(metrics.header_count, 1),
+                max(metrics.list_items, 1),
+                max(metrics.table_rows, 1),
+                max(metrics.code_blocks, 1),
+                max(metrics.links, 1)
+            ],
+            'parents': [
+                '', 'Document', 'Document', 'Document', 'Document', 'Document'
+            ]
+        }
+# ==================== FACADE ORCHESTRATOR ====================
+class VisualizationOrchestrator:
+    """
+    Strategic orchestration layer - coordinates visualization components
+    Design Philosophy:
+    - Facade Pattern: Simple interface hiding complex component interactions
+    - Dependency Injection: All components provided at construction
+    - Error Boundary: Comprehensive error handling with graceful degradation
+    - Performance Monitoring: Built-in metrics and optimization
+    """
+    def __init__(
+        self,
+        content_analyzer: Optional[ContentAnalyzer] = None,
+        chart_renderer: Optional[ChartRenderer] = None,
+        dashboard_composer: Optional[DashboardComposer] = None
+    ):
+        # Use default implementations if not provided
+        self.content_analyzer = content_analyzer or OptimizedContentAnalyzer()
+        self.chart_renderer = chart_renderer or PlotlyChartRenderer()
+        self.dashboard_composer = dashboard_composer or EnterpriseDashboardComposer(
+            self.chart_renderer
+        )
+        # Performance metrics
+        self.visualization_count = 0
+        self.error_count = 0
+        self.total_processing_time = 0.0
+    def create_quality_dashboard(self, conversion_result, analysis_result=None) -> go.Figure:
+        """
+        Primary interface for quality dashboard generation
+        Strategic Approach:
+        - Input Validation: Comprehensive parameter checking
+        - Data Transformation: Convert external formats to internal abstractions
+        - Component Coordination: Orchestrate analysis and visualization
+        - Error Recovery: Graceful degradation for failed components
+        """
+        start_time = datetime.now()
+        self.visualization_count += 1
+        try:
+            # Convert external data to internal abstraction
+            analysis_data = DocumentAnalysisData.from_processing_result(
+                conversion_result, analysis_result
+            )
+            # Generate quality assessment
+            quality_metrics = self.content_analyzer.calculate_quality_metrics(analysis_data)
+            # Analyze document structure
+            structural_metrics = self.content_analyzer.analyze_structure(analysis_data.content)
+            # Create comprehensive dashboard
+            dashboard = self.dashboard_composer.compose_quality_dashboard(
+                quality_metrics, structural_metrics
+            )
+            # Track performance
+            processing_time = (datetime.now() - start_time).total_seconds()
+            self.total_processing_time += processing_time
+            logger.info(f"Quality dashboard generated in {processing_time:.2f}s")
+            return dashboard
+        except Exception as e:
+            self.error_count += 1
+            logger.error(f"Quality dashboard generation failed: {str(e)}")
+            # Return fallback visualization
+            return self._create_error_fallback_dashboard(str(e))
+    def create_structural_analysis_viz(self, conversion_result) -> go.Figure:
+        """Generate detailed structural analysis visualization"""
+        try:
+            analysis_data = DocumentAnalysisData.from_processing_result(conversion_result)
+            structural_metrics = self.content_analyzer.analyze_structure(analysis_data.content)
+            # Create detailed structural visualization
+            return self._create_structure_analysis_dashboard(structural_metrics)
+        except Exception as e:
+            logger.error(f"Structural analysis visualization failed: {str(e)}")
+            return self._create_error_fallback_dashboard(str(e))
+    def create_export_ready_report(self, conversion_result, analysis_result=None) -> Dict[str, go.Figure]:
+        """Generate comprehensive export-ready report with multiple visualizations"""
+        try:
+            analysis_data = DocumentAnalysisData.from_processing_result(
+                conversion_result, analysis_result
+            )
+            quality_metrics = self.content_analyzer.calculate_quality_metrics(analysis_data)
+            structural_metrics = self.content_analyzer.analyze_structure(analysis_data.content)
+            # Generate multiple visualization components
+            report_figures = {
+                'executive_dashboard': self.dashboard_composer.compose_quality_dashboard(
+                    quality_metrics, structural_metrics
+                ),
+                'quality_breakdown': self.chart_renderer.render_radar_chart(
+                    quality_metrics.to_dict(),
+                    title="Quality Assessment Breakdown"
+                ),
+                'structural_analysis': self._create_structure_analysis_dashboard(structural_metrics),
+                'performance_summary': self.chart_renderer.render_gauge_chart(
+                    quality_metrics.composite_score,
+                    title="Overall Quality Score"
+                )
+            }
+            logger.info(f"Export report generated with {len(report_figures)} visualizations")
+            return report_figures
+        except Exception as e:
+            logger.error(f"Export report generation failed: {str(e)}")
+            return {
+                'error_report': self._create_error_fallback_dashboard(str(e))
+            }
+    def _create_structure_analysis_dashboard(self, structural_metrics: StructuralMetrics) -> go.Figure:
+        """Create detailed structural analysis dashboard"""
+        # Create multi-panel structural analysis
+        fig = make_subplots(
+            rows=2, cols=2,
+            subplot_titles=(
+                'Element Distribution', 'Structure Hierarchy',
+                'Content Density', 'Quality Assessment'
+            ),
+            specs=[
+                [{"type": "pie"}, {"type": "bar"}],
+                [{"type": "scatter"}, {"type": "indicator"}]
+            ]
+        )
+        # 1. Element Distribution Pie Chart
+        elements = {
+            'Headers': structural_metrics.header_count,
+            'Lists': structural_metrics.list_items,
+            'Tables': structural_metrics.table_rows,
+            'Code': structural_metrics.code_blocks,
+            'Links': structural_metrics.links
+        }
+        # Filter out zero values for cleaner visualization
+        non_zero_elements = {k: v for k, v in elements.items() if v > 0}
+        if non_zero_elements:
+            fig.add_trace(
+                go.Pie(
+                    labels=list(non_zero_elements.keys()),
+                    values=list(non_zero_elements.values()),
+                    hole=0.3,
+                    marker=dict(colors=self.chart_renderer.color_palette[:len(non_zero_elements)])
+                ),
+                row=1, col=1
+            )
+        # 2. Structure Hierarchy Bar Chart
+        hierarchy_data = {
+            'Max Depth': structural_metrics.max_header_depth,
+            'Total Elements': sum(elements.values()),
+            'Structure Score': min(structural_metrics.structure_density * 10, 10)
+        }
+        fig.add_trace(
+            go.Bar(
+                x=list(hierarchy_data.keys()),
+                y=list(hierarchy_data.values()),
+                marker=dict(color='#667eea'),
+                name='Structure Metrics'
+            ),
+            row=1, col=2
+        )
+        # 3. Content Density Analysis
+        fig.add_trace(
+            go.Scatter(
+                x=['Structure Density'],
+                y=[structural_metrics.structure_density],
+                mode='markers',
+                marker=dict(
+                    size=30,
+                    color=structural_metrics.structure_density,
+                    colorscale='Viridis',
+                    showscale=True
+                ),
+                name='Density Score'
+            ),
+            row=2, col=1
+        )
+        # 4. Structure Quality Indicator
+        structure_quality = min(structural_metrics.structure_density * 10, 10)
+        fig.add_trace(
+            go.Indicator(
+                mode="gauge+number",
+                value=structure_quality,
+                domain={'x': [0, 1], 'y': [0, 1]},
+                title={'text': "Structure Quality"},
+                gauge={
+                    'axis': {'range': [None, 10]},
+                    'bar': {'color': "#28a745"},
+                    'steps': [
+                        {'range': [0, 5], 'color': "lightgray"},
+                        {'range': [5, 8], 'color': "gray"}
+                    ]
+                }
+            ),
+            row=2, col=2
+        )
+        fig.update_layout(
+            title="Document Structure Analysis",
+            height=700,
+            showlegend=True,
+            template='plotly_white'
+        )
+        return fig
+    def _create_error_fallback_dashboard(self, error_message: str) -> go.Figure:
+        """Create fallback visualization for error scenarios"""
+        fig = go.Figure()
+        fig.add_annotation(
+            x=0.5, y=0.5,
+            xref="paper", yref="paper",
+            text=f"Visualization Error<br>{error_message[:100]}{'...' if len(error_message) > 100 else ''}",
+            showarrow=False,
+            font=dict(size=16, color="red"),
+            bgcolor="rgba(255, 0, 0, 0.1)",
+            bordercolor="red",
+            borderwidth=2
+        )
+        fig.update_layout(
+            title="Visualization Generation Error",
+            height=400,
+            template='plotly_white'
+        )
+        return fig
+    def get_performance_metrics(self) -> JSONDict:
+        """Get comprehensive performance metrics for monitoring"""
+        avg_processing_time = (
+            self.total_processing_time / self.visualization_count
+            if self.visualization_count > 0 else 0
+        )
+        success_rate = (
+            ((self.visualization_count - self.error_count) / self.visualization_count * 100)
+            if self.visualization_count > 0 else 0
+        )
+        # Get content analyzer cache statistics if available
+        cache_stats = {}
+        if hasattr(self.content_analyzer, 'get_cache_statistics'):
+            cache_stats = self.content_analyzer.get_cache_statistics()
+        return {
+            'visualizations_generated': self.visualization_count,
+            'error_count': self.error_count,
+            'success_rate_percent': success_rate,
+            'average_processing_time': avg_processing_time,
+            'total_processing_time': self.total_processing_time,
+            'cache_statistics': cache_stats,
+            'status': 'healthy' if success_rate > 90 else 'degraded' if success_rate > 70 else 'unhealthy'
+        }
+# ==================== BACKWARDS COMPATIBILITY LAYER ====================
+class InteractiveVisualizationEngine:
+    """
+    Backwards compatibility facade for existing code
+    Strategic Purpose:
+    - Maintains existing API for legacy integration
+    - Delegates to new architecture components
+    - Provides migration path to new patterns
+    - Zero breaking changes for existing consumers
+    """
+    def __init__(self, config=None):
+        # Initialize new architecture components
+        self.orchestrator = VisualizationOrchestrator()
+        self.config = config or {}
+        logger.info("InteractiveVisualizationEngine initialized with new architecture")
+    def create_quality_dashboard(self, conversion_result, analysis_result=None):
+        """Legacy API compatibility method"""
+        return self.orchestrator.create_quality_dashboard(conversion_result, analysis_result)
+    def create_structural_analysis_viz(self, conversion_result):
+        """Legacy API compatibility method"""
+        return self.orchestrator.create_structural_analysis_viz(conversion_result)
+    def create_export_ready_report(self, conversion_result, analysis_result=None):
+        """Legacy API compatibility method"""
+        return self.orchestrator.create_export_ready_report(conversion_result, analysis_result)
+    def create_comparison_analysis(self, results):
+        """Placeholder for comparison analysis - future implementation"""
+        logger.warning("Comparison analysis not yet implemented in refactored architecture")
+        # Return placeholder visualization
+        fig = go.Figure()
+        fig.add_annotation(
+            x=0.5, y=0.5,
+            xref="paper", yref="paper",
+            text="Comparison Analysis<br/>Coming Soon in Next Release",
+            showarrow=False,
+            font=dict(size=16, color="gray")
+        )
+        fig.update_layout(title="Feature Under Development", height=400)
+        return fig, pd.DataFrame()
+class QualityMetricsCalculator:
+    """
+    Backwards compatibility wrapper for quality metrics calculation
+    Delegates to new OptimizedContentAnalyzer while maintaining existing interface
+    """
+    def __init__(self):
+        self.analyzer = OptimizedContentAnalyzer()
+        logger.info("QualityMetricsCalculator initialized with optimized backend")
+    @staticmethod
+    def calculate_conversion_quality_metrics(conversion_result, analysis_result=None):
+        """Legacy API method - delegates to new architecture"""
+        # Create analyzer instance for static method compatibility
+        analyzer = OptimizedContentAnalyzer()
+        # Convert to new data format
+        analysis_data = DocumentAnalysisData.from_processing_result(
+            conversion_result, analysis_result
+        )
+        # Calculate quality assessment using new system
+        quality_assessment = analyzer.calculate_quality_metrics(analysis_data)
+        structural_metrics = analyzer.analyze_structure(analysis_data.content)
+        # Convert to legacy format for backwards compatibility
+        return {
+            'composite_score': quality_assessment.composite_score,
+            'basic_metrics': {
+                'total_words': len(analysis_data.content.split()) if analysis_data.content else 0,
+                'total_lines': len(analysis_data.content.split('\n')) if analysis_data.content else 0,
+                'total_characters': len(analysis_data.content)
+            },
+            'structural_metrics': structural_metrics.to_dict(),
+            'content_metrics': {
+                'information_density': structural_metrics.structure_density
+            },
+            'performance_metrics': {
+                'processing_time_seconds': analysis_data.processing_metrics.get('processing_time', 0),
+                'efficiency_score': quality_assessment.performance_score
+            },
+            'ai_analysis_metrics': {
+                'overall_ai_score': quality_assessment.ai_score,
+                'analysis_available': analysis_data.ai_analysis_data is not None
+            }
+        }
+# ==================== CONFIGURATION CLASSES ====================
+@dataclass
+class VisualizationConfig:
+    """Configuration container for visualization settings"""
+    class VisualizationTheme(Enum):
+        CORPORATE = "plotly_white"
+        DARK_MODERN = "plotly_dark"
+        MINIMAL = "simple_white"
+        PRESENTATION = "presentation"
+    theme: VisualizationTheme = VisualizationTheme.CORPORATE
+    width: int = 800
+    height: int = 600
+    show_legend: bool = True
+    interactive: bool = True
+    export_format: str = "html"
+    color_palette: List[str] = field(default_factory=lambda: [
+        '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
+        '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf'
+    ])
+class ReportGenerator:
+    """
+    Enterprise report generation with multiple output formats
+    Backwards compatibility wrapper around new architecture
+    """
+    def __init__(self, viz_engine):
+        if isinstance(viz_engine, InteractiveVisualizationEngine):
+            self.orchestrator = viz_engine.orchestrator
+        else:
+            # Fallback for direct orchestrator usage
+            self.orchestrator = viz_engine
+    def generate_executive_report(self, conversion_result, analysis_result=None, export_format="html"):
+        """Generate comprehensive executive report with new architecture"""
+        try:
+            # Generate visualizations using new system
+            report_figures = self.orchestrator.create_export_ready_report(
+                conversion_result, analysis_result
+            )
+            # Calculate metrics using new analyzer
+            analysis_data = DocumentAnalysisData.from_processing_result(
+                conversion_result, analysis_result
+            )
+            analyzer = OptimizedContentAnalyzer()
+            quality_metrics = analyzer.calculate_quality_metrics(analysis_data)
+            # Generate executive summary
+            executive_summary = self._generate_executive_summary(quality_metrics, analysis_result)
+            return {
+                'metadata': {
+                    'generated_at': datetime.now().isoformat(),
+                    'document_name': analysis_data.metadata.get('original_file', {}).get('filename', 'Unknown'),
+                    'overall_score': quality_metrics.composite_score
+                },
+                'executive_summary': executive_summary,
+                'visualizations': report_figures,
+                'quality_metrics': quality_metrics.to_dict(),
+                'export_format': export_format
+            }
+        except Exception as e:
+            logger.error(f"Executive report generation failed: {str(e)}")
+            return {
+                'metadata': {'generated_at': datetime.now().isoformat(), 'error': str(e)},
+                'executive_summary': {'error': 'Report generation failed'},
+                'visualizations': {},
+                'quality_metrics': {},
+                'export_format': export_format
+            }
+    def _generate_executive_summary(self, quality_metrics: QualityAssessment, analysis_result):
+        """Generate executive summary with business-friendly language"""
+        score = quality_metrics.composite_score
+        if score >= 8:
+            quality_assessment = "Excellent"
+            recommendation = "Document conversion achieved outstanding quality. Ready for production deployment."
+        elif score >= 6:
+            quality_assessment = "Good"
+            recommendation = "Document conversion quality is good with minor optimization opportunities."
+        elif score >= 4:
+            quality_assessment = "Acceptable"
+            recommendation = "Document conversion quality is acceptable. Consider improvements for enhanced results."
+        else:
+            quality_assessment = "Needs Improvement"
+            recommendation = "Document conversion quality requires attention. Review source document and processing settings."
+        key_insights = []
+        if quality_metrics.structural_score > 7:
+            key_insights.append("Strong document structure with well-organized content hierarchy.")
+        if quality_metrics.ai_score > 7:
+            key_insights.append("AI analysis confirms high-quality content extraction and processing.")
+        if quality_metrics.performance_score > 7:
+            key_insights.append("Efficient processing with optimal resource utilization.")
+        return {
+            'quality_assessment': quality_assessment,
+            'overall_score': f"{score:.1f}/10",
+            'recommendation': recommendation,
+            'key_insights': key_insights,
+            'executive_summary': f"""
+            Document conversion analysis completed with an overall quality score of {score:.1f}/10,
+            rated as {quality_assessment}. {recommendation}
+            Key performance indicators show {len(key_insights)} positive quality factors identified
+            during comprehensive analysis.
+            """
+        }
+# ==================== PUBLIC API EXPORTS ====================
+__all__ = [
+    # Core abstractions
+    'DocumentAnalysisData',
+    'StructuralMetrics',
+    'QualityAssessment',
+    'VisualizationRequest',
+    # Primary components
+    'OptimizedContentAnalyzer',
+    'PlotlyChartRenderer',
+    'EnterpriseDashboardComposer',
+    'VisualizationOrchestrator',
+    # Backwards compatibility
+    'InteractiveVisualizationEngine',
+    'QualityMetricsCalculator',
+    'ReportGenerator',
+    'VisualizationConfig',
+    # Configuration
+    'VisualizationConfig'
+]
+# ==================== MODULE INITIALIZATION ====================
+if __name__ == "__main__":
+    # Module self-test and performance benchmarking
+    logger.info("MarkItDown Visualization Engine - Architecture Validation")
+    # Test component initialization
+    try:
+        analyzer = OptimizedContentAnalyzer()
+        renderer = PlotlyChartRenderer()
+        composer = EnterpriseDashboardComposer(renderer)
+        orchestrator = VisualizationOrchestrator(analyzer, renderer, composer)
+        logger.info("✅ All components initialized successfully")
+        # Test backwards compatibility
+        legacy_engine = InteractiveVisualizationEngine()
+        legacy_calculator = QualityMetricsCalculator()
+        logger.info("✅ Backwards compatibility layer functional")
+        logger.info("🚀 Visualization engine ready for production deployment")
+    except Exception as e:
+        logger.error(f"❌ Component initialization failed: {str(e)}")
+        raise
+JSONDict = Dict[str, JsonValue]