File size: 3,222 Bytes
ac8afaa
8d6baba
 
 
 
ac8afaa
 
 
 
8d6baba
 
 
 
 
 
 
ac8afaa
 
8d6baba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
title: SWE-Bench Verified Discriminative Subsets Leaderboard
emoji: πŸ†
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: mit
tags:
- leaderboard
- software-engineering
- swe-bench
- evaluation
- benchmark
---

# πŸ† SWE-Bench Verified Discriminative Subsets Leaderboard

This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems.

## 🎯 Why Discriminative Subsets?

As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems.

## πŸ“Š The Four Discriminative Subsets

1. **πŸ”₯ Frontier Subset** (95 instances): Problems solved by ≀5 agents - maximum evaluative sensitivity
   - Combines unsolved, ultra-rare, and very-rare problems
   - Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination)

2. **⚑ Challenging Subset** (155 instances): Problems solved by ≀20 agents - strong evaluative power  
   - Balances discrimination with statistical significance
   - Includes frontier + rare and uncommon problems

3. **πŸ’ͺ Hard Subset** (45 instances): All Hard difficulty problems regardless of solve rate
   - Traditional difficulty-based evaluation
   - Focuses on problems originally classified as most difficult

4. **πŸ“ MultiFile Subset** (40 instances): Multi-file problems solved by ≀10 agents
   - Targets real-world complexity requiring coordinated edits
   - Even leading agents achieve only 10% success rate

## πŸ”¬ Methodology

Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents:
- Problems solved by fewer agents provide better discrimination
- Analysis covers submissions from October 2023 to May 2025
- "Solved" means the agent's fix passed the verification test suite

## πŸ“ˆ Key Insights

- **Enhanced Resolution**: Frontier subset provides 6x better discrimination between top systems
- **Multi-file Complexity**: Represents genuine software engineering challenges
- **Statistical Significance**: Challenging subset offers robust evaluation with strong discrimination
- **Real Progress**: Performance on these subsets indicates genuine capability advances

## πŸ”— Resources

- **Blog Post**: [From 73% to 11%: Revealing True SWE-Agent Capabilities](https://jatinganhotra.dev/blog/swe-agents/2025/06/05/swe-bench-verified-discriminative-subsets.html)
- **Dataset**: [SWE-bench_Verified-discriminative](https://huggingface.co/datasets/jatinganhotra/SWE-bench_Verified-discriminative)
- **Original SWE-Bench**: [SWE-bench.com](https://www.swebench.com/)

## πŸš€ Usage

```python
from datasets import load_dataset

# Load specific discriminative subset
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging")
```

Created by [Jatin Ganhotra](https://jatinganhotra.dev) | Last Updated: June 19 2025