--- title: SWE-Bench Verified Discriminative Subsets Leaderboard emoji: 🏆 colorFrom: blue colorTo: red sdk: gradio sdk_version: 5.34.2 app_file: app.py pinned: false license: mit tags: - leaderboard - software-engineering - swe-bench - evaluation - benchmark --- # 🏆 SWE-Bench Verified Discriminative Subsets Leaderboard This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems. ## 🎯 Why Discriminative Subsets? As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems. ## 📊 The Four Discriminative Subsets 1. **🔥 Frontier Subset** (95 instances): Problems solved by ≤5 agents - maximum evaluative sensitivity - Combines unsolved, ultra-rare, and very-rare problems - Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination) 2. **⚡ Challenging Subset** (155 instances): Problems solved by ≤20 agents - strong evaluative power - Balances discrimination with statistical significance - Includes frontier + rare and uncommon problems 3. **💪 Hard Subset** (45 instances): All Hard difficulty problems regardless of solve rate - Traditional difficulty-based evaluation - Focuses on problems originally classified as most difficult 4. **📁 MultiFile Subset** (40 instances): Multi-file problems solved by ≤10 agents - Targets real-world complexity requiring coordinated edits - Even leading agents achieve only 10% success rate ## 🔬 Methodology Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents: - Problems solved by fewer agents provide better discrimination - Analysis covers submissions from October 2023 to May 2025 - "Solved" means the agent's fix passed the verification test suite ## 📈 Key Insights - **Enhanced Resolution**: Frontier subset provides 6x better discrimination between top systems - **Multi-file Complexity**: Represents genuine software engineering challenges - **Statistical Significance**: Challenging subset offers robust evaluation with strong discrimination - **Real Progress**: Performance on these subsets indicates genuine capability advances ## 🔗 Resources - **Blog Post**: [From 73% to 11%: Revealing True SWE-Agent Capabilities](https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html) - **Dataset**: [SWE-bench_Verified-discriminative](https://huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative) - **Original SWE-Bench**: [SWE-bench.com](https://www.swebench.com/) ## 🚀 Usage ```python from datasets import load_dataset # Load specific discriminative subset frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier") challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging") ``` Created by [Jatin Ganhotra](https://jatinganhotra.dev) | Last Updated: June 19 2025