File size: 3,222 Bytes
ac8afaa 8d6baba ac8afaa 8d6baba ac8afaa 8d6baba |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
title: SWE-Bench Verified Discriminative Subsets Leaderboard
emoji: π
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: mit
tags:
- leaderboard
- software-engineering
- swe-bench
- evaluation
- benchmark
---
# π SWE-Bench Verified Discriminative Subsets Leaderboard
This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems.
## π― Why Discriminative Subsets?
As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems.
## π The Four Discriminative Subsets
1. **π₯ Frontier Subset** (95 instances): Problems solved by β€5 agents - maximum evaluative sensitivity
- Combines unsolved, ultra-rare, and very-rare problems
- Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination)
2. **β‘ Challenging Subset** (155 instances): Problems solved by β€20 agents - strong evaluative power
- Balances discrimination with statistical significance
- Includes frontier + rare and uncommon problems
3. **πͺ Hard Subset** (45 instances): All Hard difficulty problems regardless of solve rate
- Traditional difficulty-based evaluation
- Focuses on problems originally classified as most difficult
4. **π MultiFile Subset** (40 instances): Multi-file problems solved by β€10 agents
- Targets real-world complexity requiring coordinated edits
- Even leading agents achieve only 10% success rate
## π¬ Methodology
Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents:
- Problems solved by fewer agents provide better discrimination
- Analysis covers submissions from October 2023 to May 2025
- "Solved" means the agent's fix passed the verification test suite
## π Key Insights
- **Enhanced Resolution**: Frontier subset provides 6x better discrimination between top systems
- **Multi-file Complexity**: Represents genuine software engineering challenges
- **Statistical Significance**: Challenging subset offers robust evaluation with strong discrimination
- **Real Progress**: Performance on these subsets indicates genuine capability advances
## π Resources
- **Blog Post**: [From 73% to 11%: Revealing True SWE-Agent Capabilities](https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html)
- **Dataset**: [SWE-bench_Verified-discriminative](https://huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative)
- **Original SWE-Bench**: [SWE-bench.com](https://www.swebench.com/)
## π Usage
```python
from datasets import load_dataset
# Load specific discriminative subset
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging")
```
Created by [Jatin Ganhotra](https://jatinganhotra.dev) | Last Updated: June 19 2025 |