|
|
--- |
|
|
title: SWE-Bench Verified Discriminative Subsets Leaderboard |
|
|
emoji: π |
|
|
colorFrom: blue |
|
|
colorTo: red |
|
|
sdk: gradio |
|
|
sdk_version: 5.34.2 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
tags: |
|
|
- leaderboard |
|
|
- software-engineering |
|
|
- swe-bench |
|
|
- evaluation |
|
|
- benchmark |
|
|
--- |
|
|
|
|
|
# π SWE-Bench Verified Discriminative Subsets Leaderboard |
|
|
|
|
|
This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems. |
|
|
|
|
|
## π― Why Discriminative Subsets? |
|
|
|
|
|
As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems. |
|
|
|
|
|
## π The Four Discriminative Subsets |
|
|
|
|
|
1. **π₯ Frontier Subset** (95 instances): Problems solved by β€5 agents - maximum evaluative sensitivity |
|
|
- Combines unsolved, ultra-rare, and very-rare problems |
|
|
- Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination) |
|
|
|
|
|
2. **β‘ Challenging Subset** (155 instances): Problems solved by β€20 agents - strong evaluative power |
|
|
- Balances discrimination with statistical significance |
|
|
- Includes frontier + rare and uncommon problems |
|
|
|
|
|
3. **πͺ Hard Subset** (45 instances): All Hard difficulty problems regardless of solve rate |
|
|
- Traditional difficulty-based evaluation |
|
|
- Focuses on problems originally classified as most difficult |
|
|
|
|
|
4. **π MultiFile Subset** (40 instances): Multi-file problems solved by β€10 agents |
|
|
- Targets real-world complexity requiring coordinated edits |
|
|
- Even leading agents achieve only 10% success rate |
|
|
|
|
|
## π¬ Methodology |
|
|
|
|
|
Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents: |
|
|
- Problems solved by fewer agents provide better discrimination |
|
|
- Analysis covers submissions from October 2023 to May 2025 |
|
|
- "Solved" means the agent's fix passed the verification test suite |
|
|
|
|
|
## π Key Insights |
|
|
|
|
|
- **Enhanced Resolution**: Frontier subset provides 6x better discrimination between top systems |
|
|
- **Multi-file Complexity**: Represents genuine software engineering challenges |
|
|
- **Statistical Significance**: Challenging subset offers robust evaluation with strong discrimination |
|
|
- **Real Progress**: Performance on these subsets indicates genuine capability advances |
|
|
|
|
|
## π Resources |
|
|
|
|
|
- **Blog Post**: [From 73% to 11%: Revealing True SWE-Agent Capabilities](https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html) |
|
|
- **Dataset**: [SWE-bench_Verified-discriminative](https://huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative) |
|
|
- **Original SWE-Bench**: [SWE-bench.com](https://www.swebench.com/) |
|
|
|
|
|
## π Usage |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load specific discriminative subset |
|
|
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier") |
|
|
challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging") |
|
|
``` |
|
|
|
|
|
Created by [Jatin Ganhotra](https://jatinganhotra.dev) | Last Updated: June 19 2025 |