--- title: InferenceProviderTestingBackend emoji: 📈 colorFrom: yellow colorTo: indigo sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: false --- # Inference Provider Testing Dashboard A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API. ## Setup ### Prerequisites - Python 3.8+ - Hugging Face account with API token - Access to the `IPTesting` namespace on Hugging Face ### Installation 1. Clone or navigate to this repository: ```bash cd InferenceProviderTestingBackend ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Set up your Hugging Face token as an environment variable: ```bash export HF_TOKEN="your_huggingface_token_here" ``` **Important**: Your HF_TOKEN must have: - Permission to call inference providers - Write access to the `IPTesting` organization ## Usage ### Starting the Dashboard Run the Gradio app: ```bash python app.py ``` ### Initialize Models and Providers 1. Click the **"Fetch and Initialize Models/Providers"** button to automatically populate the `models_providers.txt` file with popular models and their available inference providers. 2. Alternatively, manually edit `models_providers.txt` with your desired model-provider combinations: ``` meta-llama/Llama-3.2-3B-Instruct fireworks-ai meta-llama/Llama-3.2-3B-Instruct together-ai Qwen/Qwen2.5-7B-Instruct fireworks-ai mistralai/Mistral-7B-Instruct-v0.3 together-ai ``` Format: `model_name provider_name` (separated by spaces or tabs) ### Launching Jobs 1. Enter the evaluation tasks in the **Tasks** field (e.g., `lighteval|mmlu|0|0`) 2. Verify the config file path (default: `models_providers.txt`) 3. Click **"Launch Jobs"** The system will: - Read all model-provider combinations from the config file - Launch a separate evaluation job for each combination - Log the job ID and status - Monitor job progress automatically ### Monitoring Jobs The **Job Results** table displays all jobs with: - **Model**: The model being tested - **Provider**: The inference provider - **Last Run**: Timestamp of when the job was last launched - **Status**: Current status (running/complete/failed/cancelled) - **Current Score**: Average score from the most recent run - **Previous Score**: Average score from the prior run (for comparison) - **Latest Job Id**: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates. ## Configuration ### Tasks Format The tasks parameter follows the lighteval format. Examples: - `lighteval|mmlu|0` - MMLU benchmark ### Daily Checkpoint The system automatically saves all results to the HuggingFace dataset at **00:00 (midnight)** every day. ### Data Persistence All job results are stored in a HuggingFace dataset (`IPTesting/inference-provider-test-results`), which means: - Results persist across app restarts - Historical score comparisons are maintained - Data can be accessed programmatically via the HF datasets library ## Architecture - **Main Thread**: Runs the Gradio interface - **Monitor Thread**: Updates job statuses every 30 seconds and extracts scores from completed jobs - **APScheduler**: Background scheduler that handles daily checkpoint saves at midnight (cron-based) - **Thread-safe**: Uses locks to prevent access issues when checking job_results - **HuggingFace Dataset Storage**: Persists results to `IPTesting/inference-provider-test-results` dataset ## Troubleshooting ### Jobs Not Launching - Verify your `HF_TOKEN` is set and has the required permissions - Check that the `IPTesting` namespace exists and you have access - Review logs for specific error messages ### Scores Not Appearing - Scores are extracted from job logs after completion - The extraction parses the results table that appears in job logs - It extracts the score for each task (from the first row where the task name appears) - The final score is the average of all task scores - Example table format: ``` | Task | Version | Metric | Value | Stderr | | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | ``` - If scores don't appear, check console output for extraction errors or parsing issues ## Files - [app.py](app.py) - Main Gradio application with UI and job management - [utils/](utils/) - Utility package with helper modules: - [utils/io.py](utils/io.py) - I/O operations: model/provider fetching, file operations, dataset persistence - [utils/jobs.py](utils/jobs.py) - Job management: launching, monitoring, score extraction - [models_providers.txt](models_providers.txt) - Configuration file with model-provider combinations - [requirements.txt](requirements.txt) - Python dependencies - [README.md](README.md) - This file